[jira] [Commented] (CASSANDRA-10765) add RangeIterator interface and QueryPlan for SI
[ https://issues.apache.org/jira/browse/CASSANDRA-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378193#comment-16378193 ] Corentin Chary commented on CASSANDRA-10765: Note: after having troubles like this with SASI, we ended moving to [https://github.com/Stratio/stratio-cassandra.] IMHO leveraging lucene instead of building yet another index makes much more sense. Would be great to see SASI using Lucene internally (even if it's somewhat against the current design). Before using stratio we starting experimenting with a SASI-Like Lucene enabled index, see https://github.com/criteo/biggraphite/tree/master/tools/graphiteindex > add RangeIterator interface and QueryPlan for SI > > > Key: CASSANDRA-10765 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10765 > Project: Cassandra > Issue Type: Sub-task > Components: Local Write-Read Paths >Reporter: Pavel Yaskevich >Assignee: Pavel Yaskevich >Priority: Major > Labels: 2i, sasi > Fix For: 4.x > > Attachments: server-load.png > > > Currently built-in indexes have only one way of handling > intersections/unions: pick the highest selectivity predicate and filter on > other index expressions. This is not always the most efficient approach. > Dynamic query planning based on the different index characteristics would be > more optimal. Query Plan should be able to choose how to do intersections, > unions based on the metadata provided by indexes (returned by RangeIterator) > and RangeIterator would became a base for cross index interactions and should > have information such as min/max token, estimate number of wrapped tokens etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13189) Use prompt_toolkit in cqlsh
[ https://issues.apache.org/jira/browse/CASSANDRA-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274190#comment-16274190 ] Corentin Chary commented on CASSANDRA-13189: I'm trying to go forward with this * What would be missing from the current patch ? * What's the recommended way to run cqlsh unit tests ? Thanks 1 > Use prompt_toolkit in cqlsh > --- > > Key: CASSANDRA-13189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13189 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Attachments: cqlsh-prompt-tookit.png > > > prompt_toolkit is an alternative to readline > (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a > lot of software, including the upcomming version of ipython. > I'm working on an initial version that keeps compatibility with readline, > which is available here: > https://github.com/iksaif/cassandra/tree/prompt_toolkit > It's still missing tests and a few things, but I'm opening this for tracking > and feedbacks. > !cqlsh-prompt-tookit.png|thumbnail! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13215) Cassandra nodes startup time 20x more after upgarding to 3.x
[ https://issues.apache.org/jira/browse/CASSANDRA-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268362#comment-16268362 ] Corentin Chary commented on CASSANDRA-13215: [~krummas] I tried it on our test cluster and it seems to work great. startup time got divided by 3~4. I expect it to have an even greater impact on prod (mode nodes, more sstables). > Cassandra nodes startup time 20x more after upgarding to 3.x > > > Key: CASSANDRA-13215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13215 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cluster setup: two datacenters (dc-main, dc-backup). > dc-main - 9 servers, no vnodes > dc-backup - 6 servers, vnodes >Reporter: Viktor Kuzmin >Assignee: Marcus Eriksson > Fix For: 3.11.2, 4.0 > > Attachments: simple-cache.patch > > > CompactionStrategyManage.getCompactionStrategyIndex is called on each sstable > at startup. And this function calls StorageService.getDiskBoundaries. And > getDiskBoundaries calls AbstractReplicationStrategy.getAddressRanges. > It appears that last function can be really slow. In our environment we have > 1545 tokens and with NetworkTopologyStrategy it can make 1545*1545 > computations in worst case (maybe I'm wrong, but it really takes lot's of > cpu). > Also this function can affect runtime later, cause it is called not only > during startup. > I've tried to implement simple cache for getDiskBoundaries results and now > startup time is about one minute instead of 25m, but I'm not sure if it's a > good solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13677) Make SASI timeouts easier to debug
[ https://issues.apache.org/jira/browse/CASSANDRA-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263997#comment-16263997 ] Corentin Chary commented on CASSANDRA-13677: Thanks ! :) > Make SASI timeouts easier to debug > -- > > Key: CASSANDRA-13677 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13677 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Fix For: 4.0 > > Attachments: 0001-SASI-Make-timeouts-easier-to-debug.patch > > > This would now give something like: > {code} > WARN [ReadStage-15] 2017-06-08 12:47:57,799 > AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread > Thread[ReadStage-15,5,main]: {} > java.lang.RuntimeException: > org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: > Command 'Read(biggraphite_metadata.directories columns=* > rowfilter=component_0 = criteo limits=LIMIT 5000 > range=(min(-9223372036854775808), min(-9223372036854775808)] > pfilter=names(EMPTY))' took too long (100 > 100ms). > at > org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2591) > ~[main/:na] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_131] > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) > ~[main/:na] > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) > [main/:na] > at > org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [main/:na] > at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] > Caused by: > org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: > Command 'Read(biggraphite_metadata.directories columns=* > rowfilter=component_0 = criteo limits=LIMIT 5000 > range=(min(-9223372036854775808), min(-9223372036854775808)] > pfilter=names(EMPTY))' took too long (100 > 100ms). > at > org.apache.cassandra.index.sasi.plan.QueryController.checkpoint(QueryController.java:163) > ~[main/:na] > at > org.apache.cassandra.index.sasi.plan.QueryController.getPartition(QueryController.java:117) > ~[main/:na] > at > org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:116) > ~[main/:na] > at > org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:71) > ~[main/:na] > at > org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) > ~[main/:na] > at > org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:92) > ~[main/:na] > at > org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:310) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:145) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:138) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:134) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:76) > ~[main/:na] > at > org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:333) > ~[main/:na] > at > org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1884) > ~[main/:na] > at > org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2587) > ~[main/:na] > ... 5 common frames omitted > {code} > Not having the query makes it super hard to debug. Even worse, because it > stops potentially before the slow_query threshold, it won't appear as one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263992#comment-16263992 ] Corentin Chary commented on CASSANDRA-13651: ping ? > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) but > netty+epoll only support timeouts above the milliseconds and will convert > everything bellow to 0. > I added some traces to netty
[jira] [Commented] (CASSANDRA-13215) Cassandra nodes startup time 20x more after upgarding to 3.x
[ https://issues.apache.org/jira/browse/CASSANDRA-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201558#comment-16201558 ] Corentin Chary commented on CASSANDRA-13215: I'll try to test that in our test env in the next days :) > Cassandra nodes startup time 20x more after upgarding to 3.x > > > Key: CASSANDRA-13215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13215 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cluster setup: two datacenters (dc-main, dc-backup). > dc-main - 9 servers, no vnodes > dc-backup - 6 servers, vnodes >Reporter: Viktor Kuzmin >Assignee: Marcus Eriksson > Attachments: simple-cache.patch > > > CompactionStrategyManage.getCompactionStrategyIndex is called on each sstable > at startup. And this function calls StorageService.getDiskBoundaries. And > getDiskBoundaries calls AbstractReplicationStrategy.getAddressRanges. > It appears that last function can be really slow. In our environment we have > 1545 tokens and with NetworkTopologyStrategy it can make 1545*1545 > computations in worst case (maybe I'm wrong, but it really takes lot's of > cpu). > Also this function can affect runtime later, cause it is called not only > during startup. > I've tried to implement simple cache for getDiskBoundaries results and now > startup time is about one minute instead of 25m, but I'm not sure if it's a > good solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13215) Cassandra nodes startup time 20x more after upgarding to 3.x
[ https://issues.apache.org/jira/browse/CASSANDRA-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173029#comment-16173029 ] Corentin Chary commented on CASSANDRA-13215: Cool, will be happy to test it and report performance improvements (mostly during startup) > Cassandra nodes startup time 20x more after upgarding to 3.x > > > Key: CASSANDRA-13215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13215 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cluster setup: two datacenters (dc-main, dc-backup). > dc-main - 9 servers, no vnodes > dc-backup - 6 servers, vnodes >Reporter: Viktor Kuzmin >Assignee: Marcus Eriksson > Attachments: simple-cache.patch > > > CompactionStrategyManage.getCompactionStrategyIndex is called on each sstable > at startup. And this function calls StorageService.getDiskBoundaries. And > getDiskBoundaries calls AbstractReplicationStrategy.getAddressRanges. > It appears that last function can be really slow. In our environment we have > 1545 tokens and with NetworkTopologyStrategy it can make 1545*1545 > computations in worst case (maybe I'm wrong, but it really takes lot's of > cpu). > Also this function can affect runtime later, cause it is called not only > during startup. > I've tried to implement simple cache for getDiskBoundaries results and now > startup time is about one minute instead of 25m, but I'm not sure if it's a > good solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171291#comment-16171291 ] Corentin Chary commented on CASSANDRA-13651: Thanks for double-checking that, I pushed the wrong version. This should now be fixed. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) but > netty+epoll only support timeouts above the millise
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169988#comment-16169988 ] Corentin Chary commented on CASSANDRA-13651: Results: the calls to timerfd end up costing almost as much as what epoll_wait() before. It is still more efficient to execute instead of schedule. I added a patch to bump netty to 4.1.15, and a way simpler version of my previous patch that allows one to configure the task delay. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167399#comment-16167399 ] Corentin Chary commented on CASSANDRA-13651: Ok, here is the plan (for myself): * Separate the "use only one worker group" patch. It's useful because it creates less threads, but isn't really directly related to this. * Update netty to 4.1.15 on our setup (without -Dnetty_flush_delay_nanoseconds) and see the effects * Set -Dnetty_flush_delay_nanoseconds and see the effects. Depending on the results, propose a simpler version of the patch. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166200#comment-16166200 ] Corentin Chary commented on CASSANDRA-13651: Ping, anything against https://github.com/iksaif/cassandra/commits/cassandra-13651-trunk ? If not I'll send a proper pull request. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) but > netty+
[jira] [Commented] (CASSANDRA-10496) Make DTCS/TWCS split partitions based on time during compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153196#comment-16153196 ] Corentin Chary commented on CASSANDRA-10496: * Splitting partitions is non-trivial, so I left that for later * What do you mean by "changing locations isn't supported" ? https://github.com/apache/cassandra/pull/147/files#diff-be1bfb81c770dcfb7042335b699c4cc3R112 seems to work * Currently it will create up to "minThreshold" sstables https://github.com/apache/cassandra/pull/147/files#diff-e83635b2fb3079d9b91b039c605c15daR303 * Yes, getBuckets() currently use maxTimestamp, which isn't available (currently) in the compaction task. Thus my question: what about using minTimestamp (makes sense for reads), or minLocalDeletionTime (makes sense for deletes/ttl) ? (not thinking about upgrades ATM). * Are you talking about sstables generated before this patch ? > Make DTCS/TWCS split partitions based on time during compaction > --- > > Key: CASSANDRA-10496 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10496 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: dtcs > Fix For: 4.x > > > To avoid getting old data in new time windows with DTCS (or related, like > [TWCS|CASSANDRA-9666]), we need to split out old data into its own sstable > during compaction. > My initial idea is to just create two sstables, when we create the compaction > task we state the start and end times for the window, and any data older than > the window will be put in its own sstable. > By creating a single sstable with old data, we will incrementally get the > windows correct - say we have an sstable with these timestamps: > {{[100, 99, 98, 97, 75, 50, 10]}} > and we are compacting in window {{[100, 80]}} - we would create two sstables: > {{[100, 99, 98, 97]}}, {{[75, 50, 10]}}, and the first window is now > 'correct'. The next compaction would compact in window {{[80, 60]}} and > create sstables {{[75]}}, {{[50, 10]}} etc. > We will probably also want to base the windows on the newest data in the > sstables so that we actually have older data than the window. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10496) Make DTCS/TWCS split partitions based on time during compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147182#comment-16147182 ] Corentin Chary commented on CASSANDRA-10496: So, I added https://github.com/apache/cassandra/pull/147, which seems to work for a few values. I need some feedback to go forward, in SplittingTimeWindowCompactionWriter I use minTimestamp to group values: * minLocalDeletionTime would make more sense if we want to optimize for deletions * getBuckets() uses maxTimestamp, which is not available in the metadata stats (I'm unsure of the effects of changing to minTimestamp or minLocalDeletionTime in getBuckets). With this value, running nodetool compact --split-output also works :) > Make DTCS/TWCS split partitions based on time during compaction > --- > > Key: CASSANDRA-10496 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10496 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: dtcs > Fix For: 4.x > > > To avoid getting old data in new time windows with DTCS (or related, like > [TWCS|CASSANDRA-9666]), we need to split out old data into its own sstable > during compaction. > My initial idea is to just create two sstables, when we create the compaction > task we state the start and end times for the window, and any data older than > the window will be put in its own sstable. > By creating a single sstable with old data, we will incrementally get the > windows correct - say we have an sstable with these timestamps: > {{[100, 99, 98, 97, 75, 50, 10]}} > and we are compacting in window {{[100, 80]}} - we would create two sstables: > {{[100, 99, 98, 97]}}, {{[75, 50, 10]}}, and the first window is now > 'correct'. The next compaction would compact in window {{[80, 60]}} and > create sstables {{[75]}}, {{[50, 10]}} etc. > We will probably also want to base the windows on the newest data in the > sstables so that we actually have older data than the window. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13215) Cassandra nodes startup time 20x more after upgarding to 3.x
[ https://issues.apache.org/jira/browse/CASSANDRA-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145453#comment-16145453 ] Corentin Chary commented on CASSANDRA-13215: I can confirm that this is affecting us too (startup and repairs). [~krummas] did you end up doing something for this issue ? Else I might give it a shot. > Cassandra nodes startup time 20x more after upgarding to 3.x > > > Key: CASSANDRA-13215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13215 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cluster setup: two datacenters (dc-main, dc-backup). > dc-main - 9 servers, no vnodes > dc-backup - 6 servers, vnodes >Reporter: Viktor Kuzmin >Assignee: Marcus Eriksson > Attachments: simple-cache.patch > > > CompactionStrategyManage.getCompactionStrategyIndex is called on each sstable > at startup. And this function calls StorageService.getDiskBoundaries. And > getDiskBoundaries calls AbstractReplicationStrategy.getAddressRanges. > It appears that last function can be really slow. In our environment we have > 1545 tokens and with NetworkTopologyStrategy it can make 1545*1545 > computations in worst case (maybe I'm wrong, but it really takes lot's of > cpu). > Also this function can affect runtime later, cause it is called not only > during startup. > I've tried to implement simple cache for getDiskBoundaries results and now > startup time is about one minute instead of 25m, but I'm not sure if it's a > good solution. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13743) CAPTURE not easilly usable with PAGING
[ https://issues.apache.org/jira/browse/CASSANDRA-13743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143471#comment-16143471 ] Corentin Chary commented on CASSANDRA-13743: Thanks for merging it and fixing it :) > CAPTURE not easilly usable with PAGING > -- > > Key: CASSANDRA-13743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13743 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 4.0 > > > See > https://github.com/iksaif/cassandra/commit/7ed56966a7150ced44c375af307685517d7e09a3 > for a patch fixing that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10496) Make DTCS/TWCS split partitions based on time during compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143470#comment-16143470 ] Corentin Chary commented on CASSANDRA-10496: I had a patch that would minimize the amount of compactions while trying to strictly respect the time windows (and would also make major compaction split correctly the sstables). I need to finish it and will try to find time this month. > Make DTCS/TWCS split partitions based on time during compaction > --- > > Key: CASSANDRA-10496 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10496 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: dtcs > Fix For: 4.x > > > To avoid getting old data in new time windows with DTCS (or related, like > [TWCS|CASSANDRA-9666]), we need to split out old data into its own sstable > during compaction. > My initial idea is to just create two sstables, when we create the compaction > task we state the start and end times for the window, and any data older than > the window will be put in its own sstable. > By creating a single sstable with old data, we will incrementally get the > windows correct - say we have an sstable with these timestamps: > {{[100, 99, 98, 97, 75, 50, 10]}} > and we are compacting in window {{[100, 80]}} - we would create two sstables: > {{[100, 99, 98, 97]}}, {{[75, 50, 10]}}, and the first window is now > 'correct'. The next compaction would compact in window {{[80, 60]}} and > create sstables {{[75]}}, {{[50, 10]}} etc. > We will probably also want to base the windows on the newest data in the > sstables so that we actually have older data than the window. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143466#comment-16143466 ] Corentin Chary commented on CASSANDRA-13651: Great ! I'm good with either bumping netty to this version or merging my patch, [~jjirsa] what do you think ? > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) but > netty+epoll only support tim
[jira] [Created] (CASSANDRA-13743) CAPTURE not easilly usable with PAGING
Corentin Chary created CASSANDRA-13743: -- Summary: CAPTURE not easilly usable with PAGING Key: CASSANDRA-13743 URL: https://issues.apache.org/jira/browse/CASSANDRA-13743 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Corentin Chary Fix For: 4.x See https://github.com/iksaif/cassandra/commit/7ed56966a7150ced44c375af307685517d7e09a3 for a patch fixing that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13743) CAPTURE not easilly usable with PAGING
[ https://issues.apache.org/jira/browse/CASSANDRA-13743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13743: --- Status: Patch Available (was: Open) > CAPTURE not easilly usable with PAGING > -- > > Key: CASSANDRA-13743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13743 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Corentin Chary > Fix For: 4.x > > > See > https://github.com/iksaif/cassandra/commit/7ed56966a7150ced44c375af307685517d7e09a3 > for a patch fixing that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112292#comment-16112292 ] Corentin Chary commented on CASSANDRA-13651: Using timerfd is something that I looked at, but I though that it would be easier to just change the code in Cassandra for now. I'll be out in the next three weeks but I'll definitivement try a patched version of netty when I'm back. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We sch
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111060#comment-16111060 ] Corentin Chary commented on CASSANDRA-13651: https://github.com/iksaif/cassandra/commit/c05f2eef6abc8066b69e50dc5025f17e17871f0c should fix the test. I'm running with 4.0.44. The production test is using 3.11 as a base but I'm able to start trunk on my dev machine. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a dead
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110416#comment-16110416 ] Corentin Chary commented on CASSANDRA-13651: [~jjirsa]: The cassandra-stress report is for https://github.com/criteo/biggraphite/blob/master/tools/stress/biggraphite.yaml. The bottleneck here was the lack of parallelization on the client I guess. The screenshot is the actual workload of BigGraphite: 3 nodes, 100 connected clients. [~norman]: Both timeouts of 1ms or no timeouts would achieve the same thing I guess. I tested with no timeout as it implied less changes to the actual logic (see https://github.com/iksaif/cassandra/tree/cassandra-13651-trunk). Currently (I believe that) the message itself it written after the delay, increasing the timeout will increase the latency of every operations. In my test I simply disable the timeout and schedule the flush task as soon as I can, this doesn't reduce the opportunities for batching that much if you keep the number of epoll threads low. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 >
[jira] [Comment Edited] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109278#comment-16109278 ] Corentin Chary edited comment on CASSANDRA-13651 at 8/1/17 4:53 PM: !cpu-usage.png|thumbnail! Almost ~8% of CPU saving after updating all three nodes. was (Author: iksaif): !cpu-usage.png|thumbnail! > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) bu
[jira] [Comment Edited] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109278#comment-16109278 ] Corentin Chary edited comment on CASSANDRA-13651 at 8/1/17 4:53 PM: !cpu-usage.png! Almost ~8% of CPU saving after updating all three nodes. was (Author: iksaif): !cpu-usage.png|thumbnail! Almost ~8% of CPU saving after updating all three nodes. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a
[jira] [Updated] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13651: --- Attachment: cpu-usage.png !cpu-usage.png|thumbnail! > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) but > netty+epoll only support timeouts above the milliseconds and will convert > everything bellow to 0. > I added some traces to netty (4.1): > {code} > diff --git > a/transport-na
[jira] [Updated] (CASSANDRA-13647) cassandra-test: URI is not absolute
[ https://issues.apache.org/jira/browse/CASSANDRA-13647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13647: --- Priority: Minor (was: Major) > cassandra-test: URI is not absolute > --- > > Key: CASSANDRA-13647 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13647 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Corentin Chary >Priority: Minor > Fix For: 4.x > > > With current trunk (I just added the code to print the exception): > {code} > $ ./tools/bin/cassandra-stress user profile=./biggraphite.yaml n=10 > 'ops(insert=1)' no-warmup cl=ONEjava.lang.IllegalArgumentException: URI is > not absolute > at java.net.URI.toURL(URI.java:1088) > at > org.apache.cassandra.stress.StressProfile.load(StressProfile.java:771) > at > org.apache.cassandra.stress.settings.SettingsCommandUser.(SettingsCommandUser.java:76) > at > org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:190) > at > org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:220) > at > org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:192) > at > org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:169) > at org.apache.cassandra.stress.Stress.run(Stress.java:80) > at org.apache.cassandra.stress.Stress.main(Stress.java:62) > {code} > I wasn't able to quickly find the change that caused that. > cc: [~tjake] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103168#comment-16103168 ] Corentin Chary commented on CASSANDRA-13432: [~rgerard] this is already part of 3.x, this only applies to 2.x. > MemtableReclaimMemory can get stuck because of lack of timeout in > getTopLevelColumns() > -- > > Key: CASSANDRA-13432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13432 > Project: Cassandra > Issue Type: Bug > Environment: cassandra 2.1.15 >Reporter: Corentin Chary > Fix For: 2.1.x > > > This might affect 3.x too, I'm not sure. > {code} > $ nodetool tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 32135875 0 > 0 > ReadStage 114 0 29492940 0 > 0 > RequestResponseStage 0 0 86090931 0 > 0 > ReadRepairStage 0 0 166645 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 47 0 > 0 > GossipStage 0 0 188769 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor0 0 86835 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0 92 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0563 0 > 0 > MemtablePostFlush 0 0 1500 0 > 0 > MemtableReclaimMemory 129534 0 > 0 > Native-Transport-Requests41 0 54819182 0 > 1896 > {code} > {code} > "MemtableReclaimMemory:195" - Thread t@6268 >java.lang.Thread.State: WAITING > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) > at > org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) > at > org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "SharedPool-Worker-195" - Thread t@989 >java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) > at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:240) > at > org.apache.cassandra.db.ArrayBackedSortedColumns.delete(ArrayBackedSortedColumns.java:483) > at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:153) > at > org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099971#comment-16099971 ] Corentin Chary commented on CASSANDRA-13651: Latest patch (https://github.com/iksaif/cassandra/tree/cassandra-13651-trunk) tested with an actual workload, I attached the screenshot. Looks like we can get ~2% of CPU back with -Dcassandra.netty_flush_delay_nanoseconds=0 and some more with -Dio.netty.eventLoopThreads=6. This doesn't not seem to affect latency !Screenshot (5).png|thumbnail! > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further an
[jira] [Updated] (CASSANDRA-13677) Make SASI timeouts easier to debug
[ https://issues.apache.org/jira/browse/CASSANDRA-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13677: --- Description: This would now give something like: {code} WARN [ReadStage-15] 2017-06-08 12:47:57,799 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[ReadStage-15,5,main]: {} java.lang.RuntimeException: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: Command 'Read(biggraphite_metadata.directories columns=* rowfilter=component_0 = criteo limits=LIMIT 5000 range=(min(-9223372036854775808), min(-9223372036854775808)] pfilter=names(EMPTY))' took too long (100 > 100ms). at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2591) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_131] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[main/:na] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [main/:na] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [main/:na] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] Caused by: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: Command 'Read(biggraphite_metadata.directories columns=* rowfilter=component_0 = criteo limits=LIMIT 5000 range=(min(-9223372036854775808), min(-9223372036854775808)] pfilter=names(EMPTY))' took too long (100 > 100ms). at org.apache.cassandra.index.sasi.plan.QueryController.checkpoint(QueryController.java:163) ~[main/:na] at org.apache.cassandra.index.sasi.plan.QueryController.getPartition(QueryController.java:117) ~[main/:na] at org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:116) ~[main/:na] at org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:71) ~[main/:na] at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) ~[main/:na] at org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:92) ~[main/:na] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:310) ~[main/:na] at org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:145) ~[main/:na] at org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:138) ~[main/:na] at org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:134) ~[main/:na] at org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:76) ~[main/:na] at org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:333) ~[main/:na] at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1884) ~[main/:na] at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2587) ~[main/:na] ... 5 common frames omitted {code} Not having the query makes it super hard to debug. Even worse, because it stops potentially before the slow_query threshold, it won't appear as one. was: This would now give something like: {code} WARN [ReadStage-15] 2017-06-08 12:47:57,799 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[ReadStage-15,5,main]: {} java.lang.RuntimeException: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: Command 'Read(biggraphite_metadata.directories columns=* rowfilter=component_0 = criteo limits=LIMIT 5000 range=(min(-9223372036854775808), min(-9223372036854775808)] pfilter=names(EMPTY))' took too long (100 > 100ms). at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2591) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_131] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[main/:na] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [main/:na] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [main/:na] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] Caused by: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: Command 'Read(biggraphite_metadata.directories columns=* rowfilter=compone
[jira] [Created] (CASSANDRA-13677) Make SASI timeouts easier to debug
Corentin Chary created CASSANDRA-13677: -- Summary: Make SASI timeouts easier to debug Key: CASSANDRA-13677 URL: https://issues.apache.org/jira/browse/CASSANDRA-13677 Project: Cassandra Issue Type: Improvement Components: sasi Reporter: Corentin Chary Assignee: Corentin Chary Priority: Minor Fix For: 4.x This would now give something like: {code} WARN [ReadStage-15] 2017-06-08 12:47:57,799 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[ReadStage-15,5,main]: {} java.lang.RuntimeException: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: Command 'Read(biggraphite_metadata.directories columns=* rowfilter=component_0 = criteo limits=LIMIT 5000 range=(min(-9223372036854775808), min(-9223372036854775808)] pfilter=names(EMPTY))' took too long (100 > 100ms). at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2591) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_131] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[main/:na] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [main/:na] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [main/:na] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] Caused by: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: Command 'Read(biggraphite_metadata.directories columns=* rowfilter=component_0 = criteo limits=LIMIT 5000 range=(min(-9223372036854775808), min(-9223372036854775808)] pfilter=names(EMPTY))' took too long (100 > 100ms). at org.apache.cassandra.index.sasi.plan.QueryController.checkpoint(QueryController.java:163) ~[main/:na] at org.apache.cassandra.index.sasi.plan.QueryController.getPartition(QueryController.java:117) ~[main/:na] at org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:116) ~[main/:na] at org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:71) ~[main/:na] at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) ~[main/:na] at org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:92) ~[main/:na] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:310) ~[main/:na] at org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:145) ~[main/:na] at org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:138) ~[main/:na] at org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:134) ~[main/:na] at org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:76) ~[main/:na] at org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:333) ~[main/:na] at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1884) ~[main/:na] at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2587) ~[main/:na] ... 5 common frames omitted {code} Not having the query makes it super hard to debug -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13677) Make SASI timeouts easier to debug
[ https://issues.apache.org/jira/browse/CASSANDRA-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13677: --- Attachment: 0001-SASI-Make-timeouts-easier-to-debug.patch > Make SASI timeouts easier to debug > -- > > Key: CASSANDRA-13677 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13677 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Fix For: 4.x > > Attachments: 0001-SASI-Make-timeouts-easier-to-debug.patch > > > This would now give something like: > {code} > WARN [ReadStage-15] 2017-06-08 12:47:57,799 > AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread > Thread[ReadStage-15,5,main]: {} > java.lang.RuntimeException: > org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: > Command 'Read(biggraphite_metadata.directories columns=* > rowfilter=component_0 = criteo limits=LIMIT 5000 > range=(min(-9223372036854775808), min(-9223372036854775808)] > pfilter=names(EMPTY))' took too long (100 > 100ms). > at > org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2591) > ~[main/:na] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_131] > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) > ~[main/:na] > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) > [main/:na] > at > org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [main/:na] > at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] > Caused by: > org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: > Command 'Read(biggraphite_metadata.directories columns=* > rowfilter=component_0 = criteo limits=LIMIT 5000 > range=(min(-9223372036854775808), min(-9223372036854775808)] > pfilter=names(EMPTY))' took too long (100 > 100ms). > at > org.apache.cassandra.index.sasi.plan.QueryController.checkpoint(QueryController.java:163) > ~[main/:na] > at > org.apache.cassandra.index.sasi.plan.QueryController.getPartition(QueryController.java:117) > ~[main/:na] > at > org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:116) > ~[main/:na] > at > org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:71) > ~[main/:na] > at > org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) > ~[main/:na] > at > org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:92) > ~[main/:na] > at > org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:310) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:145) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:138) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:134) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:76) > ~[main/:na] > at > org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:333) > ~[main/:na] > at > org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1884) > ~[main/:na] > at > org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2587) > ~[main/:na] > ... 5 common frames omitted > {code} > Not having the query makes it super hard to debug -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13677) Make SASI timeouts easier to debug
[ https://issues.apache.org/jira/browse/CASSANDRA-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13677: --- Status: Patch Available (was: Open) > Make SASI timeouts easier to debug > -- > > Key: CASSANDRA-13677 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13677 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Fix For: 4.x > > Attachments: 0001-SASI-Make-timeouts-easier-to-debug.patch > > > This would now give something like: > {code} > WARN [ReadStage-15] 2017-06-08 12:47:57,799 > AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread > Thread[ReadStage-15,5,main]: {} > java.lang.RuntimeException: > org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: > Command 'Read(biggraphite_metadata.directories columns=* > rowfilter=component_0 = criteo limits=LIMIT 5000 > range=(min(-9223372036854775808), min(-9223372036854775808)] > pfilter=names(EMPTY))' took too long (100 > 100ms). > at > org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2591) > ~[main/:na] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_131] > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) > ~[main/:na] > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) > [main/:na] > at > org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [main/:na] > at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131] > Caused by: > org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: > Command 'Read(biggraphite_metadata.directories columns=* > rowfilter=component_0 = criteo limits=LIMIT 5000 > range=(min(-9223372036854775808), min(-9223372036854775808)] > pfilter=names(EMPTY))' took too long (100 > 100ms). > at > org.apache.cassandra.index.sasi.plan.QueryController.checkpoint(QueryController.java:163) > ~[main/:na] > at > org.apache.cassandra.index.sasi.plan.QueryController.getPartition(QueryController.java:117) > ~[main/:na] > at > org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:116) > ~[main/:na] > at > org.apache.cassandra.index.sasi.plan.QueryPlan$ResultIterator.computeNext(QueryPlan.java:71) > ~[main/:na] > at > org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) > ~[main/:na] > at > org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:92) > ~[main/:na] > at > org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:310) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:145) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:138) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:134) > ~[main/:na] > at > org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:76) > ~[main/:na] > at > org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:333) > ~[main/:na] > at > org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1884) > ~[main/:na] > at > org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2587) > ~[main/:na] > ... 5 common frames omitted > {code} > Not having the query makes it super hard to debug -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073502#comment-16073502 ] Corentin Chary commented on CASSANDRA-13651: I ran tests on a 3 node cluster, I can confirm that not using scheduled tasks and using a simpler batcher removes all the {{epoll_wait(..., 0)}} calls. This reduces the CPU used by epoll threads. I need to take more time do check how efficient the batching still is, and compare the context switchs with and without it. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all tha
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072162#comment-16072162 ] Corentin Chary commented on CASSANDRA-13651: I cooked a patch to use spotify's netty-batch-flusher instead of the current flusher. Here are some results: {code} normal: Results: Op rate :4,220 op/s [insert: 4,220 op/s] Partition rate:4,220 pk/s [insert: 4,220 pk/s] Row rate : 41,851 row/s [insert: 41,851 row/s] Latency mean :0.2 ms [insert: 0.2 ms] Latency median:0.2 ms [insert: 0.2 ms] Latency 95th percentile :0.2 ms [insert: 0.2 ms] Latency 99th percentile :0.3 ms [insert: 0.3 ms] Latency 99.9th percentile :0.4 ms [insert: 0.4 ms] Latency max : 65.5 ms [insert: 65.5 ms] Total partitions :100,000 [insert: 100,000] Total errors : 0 [insert: 0] Total GC count: 6 Total GC memory : 3.473 GiB Total GC time :0.4 seconds Avg GC time : 60.0 ms StdDev GC time:5.1 ms Total operation time : 00:00:23 {code} {code} batched: Results: Op rate :4,344 op/s [insert: 4,344 op/s] Partition rate:4,344 pk/s [insert: 4,344 pk/s] Row rate : 43,121 row/s [insert: 43,121 row/s] Latency mean :0.2 ms [insert: 0.2 ms] Latency median:0.2 ms [insert: 0.2 ms] Latency 95th percentile :0.2 ms [insert: 0.2 ms] Latency 99th percentile :0.3 ms [insert: 0.3 ms] Latency 99.9th percentile :0.4 ms [insert: 0.4 ms] Latency max : 63.4 ms [insert: 63.4 ms] Total partitions :100,000 [insert: 100,000] Total errors : 0 [insert: 0] Total GC count: 6 Total GC memory : 3.467 GiB Total GC time :0.4 seconds Avg GC time : 60.0 ms StdDev GC time:3.3 ms Total operation time : 00:00:23 {code} So slightly more QPS, but more interestingly, the epoll thread now uses ~4 times less CPU. I'll try to do a full scale benchmark on a bigger workload with 3 nodes tomorrow. Patch at https://github.com/iksaif/cassandra/tree/cassandra-13651-trunk > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x >
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070997#comment-16070997 ] Corentin Chary commented on CASSANDRA-13651: Also check: * https://github.com/netty/netty/issues/1759 * https://gist.github.com/jadbaz/47d98da0ead2e71659f343b14ef05de6 * Benchmark batching vs. stupid writeAndFlush() * It's unclear why sending the response is done in the flusher right now * https://github.com/spotify/netty-batch-flusher > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{M
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070535#comment-16070535 ] Corentin Chary commented on CASSANDRA-13651: Things to check or try (for me): * io.netty.eventLoopThreads * Check if we could use the same eventloop instead of starting two * Create a custom SelectStrategy that skips looking at fds if there is a scheduled task happening in a few microseconds * Try to understand why Message::Flusher currently works this way > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We sc
[jira] [Updated] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13651: --- Description: I was trying to profile Cassandra under my workload and I kept seeing this backtrace: {code} epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java (native) io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) Native.java:111 io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) EpollEventLoop.java:230 io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEventExecutor.java:858 io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() DefaultThreadFactory.java:138 java.lang.Thread.run() Thread.java:745 {code} At fist I though that the profiler might not be able to profile native code properly, but I wen't further and I realized that most of the CPU was used by {{epoll_wait()}} calls with a timeout of zero. Here is the output of perf on this system, which confirms that most of the overhead was with timeout == 0. {code} Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): 11594448 Overhead Trace output ◆ 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, timeout: 0x ▒ 5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, timeout: 0x ▒ 1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, timeout: 0x03e8 ▒ 0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, timeout: 0x ▒ 0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, timeout: 0x ▒ 0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, timeout: 0x ▒ 0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, timeout: 0x {code} Running this time with perf record -ag for call traces: {code} # Children Self sys usr Trace output # # 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: 0x7fca452d6000, maxevents: 0x1000, timeout: 0x | ---0x1000200af313 | --8.61%--0x7fca6117bdac 0x7fca60459804 epoll_wait 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 | ---0x1000200af313 0x7fca6117b830 0x7fca60459804 epoll_wait {code} That looks like a lot of CPU used to wait for nothing. I'm not sure if pref reports a per-CPU percentage or a per-system percentage, but that would be still be 10% of the total CPU usage of Cassandra at the minimum. I went further and found the code of all that: We schedule a lot of {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) but netty+epoll only support timeouts above the milliseconds and will convert everything bellow to 0. I added some traces to netty (4.1): {code} diff --git a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java index 909088fde..8734bbfd4 100644 --- a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java +++ b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java @@ -208,10 +208,15 @@ final class EpollEventLoop extends SingleThreadEventLoop { long currentTimeNanos = System.nanoTime(); long selectDeadLineNanos = currentTimeNanos + delayNanos(currentTimeNanos); for (;;)
[jira] [Created] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
Corentin Chary created CASSANDRA-13651: -- Summary: Large amount of CPU used by epoll_wait(.., .., .., 0) Key: CASSANDRA-13651 URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 Project: Cassandra Issue Type: Bug Reporter: Corentin Chary Fix For: 4.x I was trying to profile Cassandra under my workload and I kept seeing this backtrace: {code} epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java (native) io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) Native.java:111 io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) EpollEventLoop.java:230 io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEventExecutor.java:858 io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() DefaultThreadFactory.java:138 java.lang.Thread.run() Thread.java:745 {code} At fist I though that the profiler might not be able to profile native code properly, but I wen't further and I realized that most of the CPU was used by epoll_wait() calls with a timeout of zero. Here is the output of perf on this system, which confirms that most of the overhead was with timeout == 0. {code} Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): 11594448 Overhead Trace output ◆ 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, timeout: 0x ▒ 5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, timeout: 0x ▒ 1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, timeout: 0x03e8 ▒ 0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, timeout: 0x ▒ 0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, timeout: 0x ▒ 0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, timeout: 0x ▒ 0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, timeout: 0x {code} Running this time with perf record -ag for call traces: {code} # Children Self sys usr Trace output # # 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: 0x7fca452d6000, maxevents: 0x1000, timeout: 0x | ---0x1000200af313 | --8.61%--0x7fca6117bdac 0x7fca60459804 epoll_wait 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 | ---0x1000200af313 0x7fca6117b830 0x7fca60459804 epoll_wait {code} That looks like a lot of CPU used to wait for nothing. I'm not sure if pref reports a per-CPU percentage or a per-system percentage, but that would be still be 10% of the total CPU usage of Cassandra at the minimum. I went further and found the code of all that: We schedule a lot of Message::Flusher with a deadline of 10 usec (5 per messages I think) but netty+epoll only support timeouts above the milliseconds and will convert everything bellow to 0. I added some traces to netty (4.1): {code} diff --git a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java index 909088fde..8734bbfd4 100644 --- a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java +++ b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java @@ -208,10 +208,15 @@ final class EpollEventLoop extends SingleThreadEventLoop
[jira] [Commented] (CASSANDRA-13647) cassandra-test: URI is not absolute
[ https://issues.apache.org/jira/browse/CASSANDRA-13647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068416#comment-16068416 ] Corentin Chary commented on CASSANDRA-13647: Note: using file:/// works > cassandra-test: URI is not absolute > --- > > Key: CASSANDRA-13647 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13647 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Corentin Chary > Fix For: 4.x > > > With current trunk (I just added the code to print the exception): > {code} > $ ./tools/bin/cassandra-stress user profile=./biggraphite.yaml n=10 > 'ops(insert=1)' no-warmup cl=ONEjava.lang.IllegalArgumentException: URI is > not absolute > at java.net.URI.toURL(URI.java:1088) > at > org.apache.cassandra.stress.StressProfile.load(StressProfile.java:771) > at > org.apache.cassandra.stress.settings.SettingsCommandUser.(SettingsCommandUser.java:76) > at > org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:190) > at > org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:220) > at > org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:192) > at > org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:169) > at org.apache.cassandra.stress.Stress.run(Stress.java:80) > at org.apache.cassandra.stress.Stress.main(Stress.java:62) > {code} > I wasn't able to quickly find the change that caused that. > cc: [~tjake] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-13647) cassandra-test: URI is not absolute
Corentin Chary created CASSANDRA-13647: -- Summary: cassandra-test: URI is not absolute Key: CASSANDRA-13647 URL: https://issues.apache.org/jira/browse/CASSANDRA-13647 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Corentin Chary Fix For: 4.x With current trunk (I just added the code to print the exception): {code} $ ./tools/bin/cassandra-stress user profile=./biggraphite.yaml n=10 'ops(insert=1)' no-warmup cl=ONEjava.lang.IllegalArgumentException: URI is not absolute at java.net.URI.toURL(URI.java:1088) at org.apache.cassandra.stress.StressProfile.load(StressProfile.java:771) at org.apache.cassandra.stress.settings.SettingsCommandUser.(SettingsCommandUser.java:76) at org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:190) at org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:220) at org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:192) at org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:169) at org.apache.cassandra.stress.Stress.run(Stress.java:80) at org.apache.cassandra.stress.Stress.main(Stress.java:62) {code} I wasn't able to quickly find the change that caused that. cc: [~tjake] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13444) Fast and garbage-free Streaming Histogram
[ https://issues.apache.org/jira/browse/CASSANDRA-13444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058932#comment-16058932 ] Corentin Chary commented on CASSANDRA-13444: Should we consider for inclusion in 3.11 ? > Fast and garbage-free Streaming Histogram > - > > Key: CASSANDRA-13444 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13444 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Fuud >Assignee: Fuud > Fix For: 4.0 > > Attachments: results.csv, results.xlsx > > > StreamingHistogram is cause of high cpu usage and GC pressure. > It was improved at CASSANDRA-13038 by introducing intermediate buffer to try > accumulate different values into the big map before merging them into smaller > one. > But there was not enought for TTL's distributed within large time. Rounding > (also introduced at 13038) can help but it reduce histogram precision > specially in case where TTL's does not distributed uniformly. > There are several improvements that can help to reduce cpu and gc usage. Them > all included in the pool-request as separate revisions thus you can test them > independently. > Improvements list: > # Use Map.computeIfAbsent instead of get->checkIfNull->put chain. In this way > "add-or-accumulate" operation takes one map operation instead of two. But > this method (default-defined in interface Map) is overriden in HashMap but > not in TreeMap. Thus I changed spool type to HashMap. > # As we round incoming values to _roundSeconds_ we can also round value on > merge. It will enlarge hit rate for bin operations. > # Because we inserted only integers into Histogram and rounding values to > integers we can use *int* type everywhere. > # Histogram takes huge amount of time merging values. In merge method largest > amount of time taken by finding nearest points. It can be eliminated by > holding additional TreeSet with differences, sorted from smalest to greatest. > # Because we know max size of _bin_ and _differences_ maps we can replace > them with sorted arrays. Search can be done with _Arrays.binarySearch_ and > insertion/deletions can be done by _System.arraycopy_. Also it helps to merge > some operations into one. > # Because spool map is also limited we can replace it with open address > primitive map. It's finaly reduce allocation rate to zero. > You can see gain given by each step in the attached file. First number is > time for one benchmark invocation and second - is allocation rate in Mb per > operation. > Dependends of payload time is reduced up to 90%. > Overall gain: > |.|.|Payload/SpoolSize|.|.|.|% from original > |.|.|.|original|.|optimized| > |.|.|secondInMonth/0|.|.|.| > |time ms/op|.|.|10747,684|.|5545,063|51,6 > |allocation Mb/op|.|.|2441,38858|.|0,002105713|0 > |.|.|.|.|.|.| > |.|.|secondInMonth/1000|.|.|.| > |time ms/op|.|.|8988,578|.|5791,179|64,4 > |allocation Mb/op|.|.|2440,951141|.|0,017715454|0 > |.|.|.|.|.|.| > |.|.|secondInMonth/1|.|.|.| > |time ms/op|.|.|10711,671|.|5765,243|53,8 > |allocation Mb/op|.|.|2437,022537|.|0,264083862|0 > |.|.|.|.|.|.| > |.|.|secondInMonth/10|.|.|.| > |time ms/op|.|.|13001,841|.|5638,069|43,4 > |allocation Mb/op|.|.|2396,947113|.|2,003662109|0,1 > |.|.|.|.|.|.| > |.|.|secondInDay/0|.|.|.| > |time ms/op|.|.|10381,833|.|5497,804|53 > |allocation Mb/op|.|.|2441,166107|.|0,002105713|0 > |.|.|.|.|.|.| > |.|.|secondInDay/1000|.|.|.| > |time ms/op|.|.|8522,157|.|5929,871|69,6 > |allocation Mb/op|.|.|1973,112381|.|0,017715454|0 > |.|.|.|.|.|.| > |.|.|secondInDay/1|.|.|.| > |time ms/op|.|.|10234,978|.|5480,077|53,5 > |allocation Mb/op|.|.|2306,057404|.|0,262969971|0 > |.|.|.|.|.|.| > |.|.|secondInDay/10|.|.|.| > |time ms/op|.|.|2971,178|.|139,079|4,7 > |allocation Mb/op|.|.|172,1276245|.|2,001721191|1,2 > |.|.|.|.|.|.| > |.|.|secondIn3Hour/0|.|.|.| > |time ms/op|.|.|10663,123|.|5605,672|52,6 > |allocation Mb/op|.|.|2439,456818|.|0,002105713|0 > |.|.|.|.|.|.| > |.|.|secondIn3Hour/1000|.|.|.| > |time ms/op|.|.|9029,788|.|5838,618|64,7 > |allocation Mb/op|.|.|2331,839249|.|0,180664063|0 > |.|.|.|.|.|.| > |.|.|secondIn3Hour/1|.|.|.| > |time ms/op|.|.|4862,409|.|89,001|1,8 > |allocation Mb/op|.|.|965,4871887|.|0,251711652|0 > |.|.|.|.|.|.| > |.|.|secondIn3Hour/10|.|.|.| > |time ms/op|.|.|1484,454|.|95,044|6,4 > |allocation Mb/op|.|.|153,2464722|.|2,001712809|1,3 > |.|.|.|.|.|.| > |.|.|secondInMin/0|.|.|.| > |time ms/op|.|.|875,118|.|424,11|48,5 > |allocation Mb/op|.|.|610,3554993|.|0,001776123|0 > |.|.|.|.|.|.| > |.|.|secondInMin/1000|.|.|.| > |time ms/op|.|.|568,7|.|84,208|14,8 > |allocation Mb/op|.|.|0,007598114|.|0,01810023|238,2 > |.|.|.|.|.|.| > |.|.|secondInMin/1|.|.|.| > |time ms/op|.|.|573,595|.|83,862|14,6 > |allocation Mb/op|.|.|
[jira] [Comment Edited] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057392#comment-16057392 ] Corentin Chary edited comment on CASSANDRA-13418 at 6/21/17 12:21 PM: -- Latest version of the patch works as it should: https://github.com/criteo-forks/cassandra/commit/da4a5c17448dab64aeb4295bb7401afbea9edf51 !twcs-cleanup.png! was (Author: iksaif): Latest version of the patch works as it should: https://github.com/criteo-forks/cassandra/commit/da4a5c17448dab64aeb4295bb7401afbea9edf51 !twcs-cleanup.png|thumbnail! > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > Attachments: twcs-cleanup.png > > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057392#comment-16057392 ] Corentin Chary commented on CASSANDRA-13418: Latest version of the patch works as it should: https://github.com/criteo-forks/cassandra/commit/da4a5c17448dab64aeb4295bb7401afbea9edf51 !twcs-cleanup.png|thumbnail! > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > Attachments: twcs-cleanup.png > > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13418: --- Attachment: twcs-cleanup.png > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > Attachments: twcs-cleanup.png > > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044493#comment-16044493 ] Corentin Chary commented on CASSANDRA-13432: We have a case internally were upgrading to 3.0 or changing the data model won't happen, and we know that we *need* this patch for another year. We're currently keeping a forked version, so that's not so much of an issue. I don't believe this patch really changes the behavior as it simply aborts earlier what would anyway would have been aborted later (later may currently be minutes~hours later). > MemtableReclaimMemory can get stuck because of lack of timeout in > getTopLevelColumns() > -- > > Key: CASSANDRA-13432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13432 > Project: Cassandra > Issue Type: Bug > Environment: cassandra 2.1.15 >Reporter: Corentin Chary > Fix For: 2.1.x > > > This might affect 3.x too, I'm not sure. > {code} > $ nodetool tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 32135875 0 > 0 > ReadStage 114 0 29492940 0 > 0 > RequestResponseStage 0 0 86090931 0 > 0 > ReadRepairStage 0 0 166645 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 47 0 > 0 > GossipStage 0 0 188769 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor0 0 86835 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0 92 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0563 0 > 0 > MemtablePostFlush 0 0 1500 0 > 0 > MemtableReclaimMemory 129534 0 > 0 > Native-Transport-Requests41 0 54819182 0 > 1896 > {code} > {code} > "MemtableReclaimMemory:195" - Thread t@6268 >java.lang.Thread.State: WAITING > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) > at > org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) > at > org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "SharedPool-Worker-195" - Thread t@989 >java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) >
[jira] [Comment Edited] (CASSANDRA-10765) add RangeIterator interface and QueryPlan for SI
[ https://issues.apache.org/jira/browse/CASSANDRA-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042791#comment-16042791 ] Corentin Chary edited comment on CASSANDRA-10765 at 6/9/17 12:20 PM: - Note: https://github.com/iksaif/cassandra/commit/edbc0a0572b47ef5d5f25d56bd43587eb136170a was an attempt at improving that, which work very well in the cases where multiple indexes are queried and some of them intersect but all of them do not. Before/After: !server-load.png|thumbnail! was (Author: iksaif): Note: https://github.com/iksaif/cassandra/commit/edbc0a0572b47ef5d5f25d56bd43587eb136170a was an attempt at improving that, which work very well in the cases where multiple indexes are queried and some of them intersect but all of them do not. Before/After: !server-load.jpg|thumbnail! > add RangeIterator interface and QueryPlan for SI > > > Key: CASSANDRA-10765 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10765 > Project: Cassandra > Issue Type: Sub-task > Components: Local Write-Read Paths >Reporter: Pavel Yaskevich >Assignee: Pavel Yaskevich > Fix For: 4.x > > Attachments: server-load.png > > > Currently built-in indexes have only one way of handling > intersections/unions: pick the highest selectivity predicate and filter on > other index expressions. This is not always the most efficient approach. > Dynamic query planning based on the different index characteristics would be > more optimal. Query Plan should be able to choose how to do intersections, > unions based on the metadata provided by indexes (returned by RangeIterator) > and RangeIterator would became a base for cross index interactions and should > have information such as min/max token, estimate number of wrapped tokens etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-10765) add RangeIterator interface and QueryPlan for SI
[ https://issues.apache.org/jira/browse/CASSANDRA-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042791#comment-16042791 ] Corentin Chary edited comment on CASSANDRA-10765 at 6/9/17 12:20 PM: - Note: https://github.com/iksaif/cassandra/commit/edbc0a0572b47ef5d5f25d56bd43587eb136170a was an attempt at improving that, which work very well in the cases where multiple indexes are queried and some of them intersect but all of them do not. Before/After: !server-load.png! was (Author: iksaif): Note: https://github.com/iksaif/cassandra/commit/edbc0a0572b47ef5d5f25d56bd43587eb136170a was an attempt at improving that, which work very well in the cases where multiple indexes are queried and some of them intersect but all of them do not. Before/After: !server-load.png|thumbnail! > add RangeIterator interface and QueryPlan for SI > > > Key: CASSANDRA-10765 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10765 > Project: Cassandra > Issue Type: Sub-task > Components: Local Write-Read Paths >Reporter: Pavel Yaskevich >Assignee: Pavel Yaskevich > Fix For: 4.x > > Attachments: server-load.png > > > Currently built-in indexes have only one way of handling > intersections/unions: pick the highest selectivity predicate and filter on > other index expressions. This is not always the most efficient approach. > Dynamic query planning based on the different index characteristics would be > more optimal. Query Plan should be able to choose how to do intersections, > unions based on the metadata provided by indexes (returned by RangeIterator) > and RangeIterator would became a base for cross index interactions and should > have information such as min/max token, estimate number of wrapped tokens etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-10765) add RangeIterator interface and QueryPlan for SI
[ https://issues.apache.org/jira/browse/CASSANDRA-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-10765: --- Attachment: server-load.png > add RangeIterator interface and QueryPlan for SI > > > Key: CASSANDRA-10765 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10765 > Project: Cassandra > Issue Type: Sub-task > Components: Local Write-Read Paths >Reporter: Pavel Yaskevich >Assignee: Pavel Yaskevich > Fix For: 4.x > > Attachments: server-load.png > > > Currently built-in indexes have only one way of handling > intersections/unions: pick the highest selectivity predicate and filter on > other index expressions. This is not always the most efficient approach. > Dynamic query planning based on the different index characteristics would be > more optimal. Query Plan should be able to choose how to do intersections, > unions based on the metadata provided by indexes (returned by RangeIterator) > and RangeIterator would became a base for cross index interactions and should > have information such as min/max token, estimate number of wrapped tokens etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-10765) add RangeIterator interface and QueryPlan for SI
[ https://issues.apache.org/jira/browse/CASSANDRA-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042791#comment-16042791 ] Corentin Chary edited comment on CASSANDRA-10765 at 6/9/17 12:20 PM: - Note: https://github.com/iksaif/cassandra/commit/edbc0a0572b47ef5d5f25d56bd43587eb136170a was an attempt at improving that, which work very well in the cases where multiple indexes are queried and some of them intersect but all of them do not. Before/After: !server-load.jpg|thumbnail! was (Author: iksaif): Note: https://github.com/iksaif/cassandra/commit/edbc0a0572b47ef5d5f25d56bd43587eb136170a was an attempt at improving that, which work very well in the cases where multiple indexes are queried and some of them intersect but all of them do not. > add RangeIterator interface and QueryPlan for SI > > > Key: CASSANDRA-10765 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10765 > Project: Cassandra > Issue Type: Sub-task > Components: Local Write-Read Paths >Reporter: Pavel Yaskevich >Assignee: Pavel Yaskevich > Fix For: 4.x > > Attachments: server-load.png > > > Currently built-in indexes have only one way of handling > intersections/unions: pick the highest selectivity predicate and filter on > other index expressions. This is not always the most efficient approach. > Dynamic query planning based on the different index characteristics would be > more optimal. Query Plan should be able to choose how to do intersections, > unions based on the metadata provided by indexes (returned by RangeIterator) > and RangeIterator would became a base for cross index interactions and should > have information such as min/max token, estimate number of wrapped tokens etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10765) add RangeIterator interface and QueryPlan for SI
[ https://issues.apache.org/jira/browse/CASSANDRA-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042791#comment-16042791 ] Corentin Chary commented on CASSANDRA-10765: Note: https://github.com/iksaif/cassandra/commit/edbc0a0572b47ef5d5f25d56bd43587eb136170a was an attempt at improving that, which work very well in the cases where multiple indexes are queried and some of them intersect but all of them do not. > add RangeIterator interface and QueryPlan for SI > > > Key: CASSANDRA-10765 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10765 > Project: Cassandra > Issue Type: Sub-task > Components: Local Write-Read Paths >Reporter: Pavel Yaskevich >Assignee: Pavel Yaskevich > Fix For: 4.x > > > Currently built-in indexes have only one way of handling > intersections/unions: pick the highest selectivity predicate and filter on > other index expressions. This is not always the most efficient approach. > Dynamic query planning based on the different index characteristics would be > more optimal. Query Plan should be able to choose how to do intersections, > unions based on the metadata provided by indexes (returned by RangeIterator) > and RangeIterator would became a base for cross index interactions and should > have information such as min/max token, estimate number of wrapped tokens etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15988490#comment-15988490 ] Corentin Chary commented on CASSANDRA-13418: I agree that fixing CASSANDRA-13418 would be a better solution, but this is likely to take more time, and I'm unsure we will get to the point where it really solve our issues in all the cases. I'll would be inclined to add the option too, with appropriate documentation. > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10496) Make DTCS/TWCS split partitions based on time during compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15988485#comment-15988485 ] Corentin Chary commented on CASSANDRA-10496: Inspecting each timestamp on each cell is surely more correct, but in the first version I'll be looking only at the minTimestamp of the partition (as long as you have short living partitions). With the current writer mechanism I didn't find a way to switch the writer in the middle of a partition anyway.. > Make DTCS/TWCS split partitions based on time during compaction > --- > > Key: CASSANDRA-10496 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10496 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: dtcs > Fix For: 3.11.x > > > To avoid getting old data in new time windows with DTCS (or related, like > [TWCS|CASSANDRA-9666]), we need to split out old data into its own sstable > during compaction. > My initial idea is to just create two sstables, when we create the compaction > task we state the start and end times for the window, and any data older than > the window will be put in its own sstable. > By creating a single sstable with old data, we will incrementally get the > windows correct - say we have an sstable with these timestamps: > {{[100, 99, 98, 97, 75, 50, 10]}} > and we are compacting in window {{[100, 80]}} - we would create two sstables: > {{[100, 99, 98, 97]}}, {{[75, 50, 10]}}, and the first window is now > 'correct'. The next compaction would compact in window {{[80, 60]}} and > create sstables {{[75]}}, {{[50, 10]}} etc. > We will probably also want to base the windows on the newest data in the > sstables so that we actually have older data than the window. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10496) Make DTCS/TWCS split partitions based on time during compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15988455#comment-15988455 ] Corentin Chary commented on CASSANDRA-10496: I wanted to give it a shot for TWCS because of CASSANDRA-13418, I was thinking about using a custom CompactionAwareWriter to seggregate data by timestamp in the first window (and also make --split-output work). Currently I'm planning to use partition.stats().minTimestamp, but I'm not sure how it is affect by read-repairs. It may be a better idea to group data by deletion time instead .. > Make DTCS/TWCS split partitions based on time during compaction > --- > > Key: CASSANDRA-10496 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10496 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: dtcs > Fix For: 3.11.x > > > To avoid getting old data in new time windows with DTCS (or related, like > [TWCS|CASSANDRA-9666]), we need to split out old data into its own sstable > during compaction. > My initial idea is to just create two sstables, when we create the compaction > task we state the start and end times for the window, and any data older than > the window will be put in its own sstable. > By creating a single sstable with old data, we will incrementally get the > windows correct - say we have an sstable with these timestamps: > {{[100, 99, 98, 97, 75, 50, 10]}} > and we are compacting in window {{[100, 80]}} - we would create two sstables: > {{[100, 99, 98, 97]}}, {{[75, 50, 10]}}, and the first window is now > 'correct'. The next compaction would compact in window {{[80, 60]}} and > create sstables {{[75]}}, {{[50, 10]}} etc. > We will probably also want to base the windows on the newest data in the > sstables so that we actually have older data than the window. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984600#comment-15984600 ] Corentin Chary commented on CASSANDRA-13418: [~krummas]: good point, CASSANDRA-10496 seems to come with its own set of issues: number of sstables would probably get huge, except if you add some kind of "buffering" like what is done for the first window. I'll see if I can find a reasonable solution for TWCS or propose it in the related ticket. If can't agree on a good solution, we can fallback to what is proposed here. [~adejanovski]: about skipping getOverlappingSSTables() completely, I though about that too, but I think it's used in some other place and I wasn't sure what the result would be. > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984354#comment-15984354 ] Corentin Chary commented on CASSANDRA-13418: Trying to go forward, [~jjirsa], [~adejanovski], [~krummas]: what is your opinion on adding a custom option to TWCS and DTCS only doing basically what my current patch does ? The only drawback that I see if a fully expired overlapping table is removed is that read-repaired data that was explicitely deleted could eventually re-appear. If you're aware of more dangerous situations I'd be glad to hear about it. > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13432: --- Attachment: (was: CASSANDRA-13432.patch) > MemtableReclaimMemory can get stuck because of lack of timeout in > getTopLevelColumns() > -- > > Key: CASSANDRA-13432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13432 > Project: Cassandra > Issue Type: Bug > Environment: cassandra 2.1.15 >Reporter: Corentin Chary > Fix For: 2.1.x > > > This might affect 3.x too, I'm not sure. > {code} > $ nodetool tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 32135875 0 > 0 > ReadStage 114 0 29492940 0 > 0 > RequestResponseStage 0 0 86090931 0 > 0 > ReadRepairStage 0 0 166645 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 47 0 > 0 > GossipStage 0 0 188769 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor0 0 86835 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0 92 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0563 0 > 0 > MemtablePostFlush 0 0 1500 0 > 0 > MemtableReclaimMemory 129534 0 > 0 > Native-Transport-Requests41 0 54819182 0 > 1896 > {code} > {code} > "MemtableReclaimMemory:195" - Thread t@6268 >java.lang.Thread.State: WAITING > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) > at > org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) > at > org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "SharedPool-Worker-195" - Thread t@989 >java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) > at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:240) > at > org.apache.cassandra.db.ArrayBackedSortedColumns.delete(ArrayBackedSortedColumns.java:483) > at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:153) > at > org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:184) > at > org.apache.cassandra.db.filter.QueryFilter$2.hasNext
[jira] [Updated] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13418: --- Agreed for the option. Would be easy to implement it using a new one. IMOH it's more dangerous to have nothing as this would degrade write performances and take up to twice the space originally planned. Compared to that it isn't really an issue to have re-appearing data after an explicit deletion (I think that's the worse that can happen, can be wrong) > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974024#comment-15974024 ] Corentin Chary commented on CASSANDRA-13418: [~rgerard]: No it should certainly not be the default. If you look at the description of our usecase, it's only necessary when you have short-lived data with a lot of cells, which make running periodic repairs impossible or very impractical and when you also need/want read-repairs because you can't afford QUORUM reads (datacenters on separate continents and low latency requirements). So there is a need for it, but it should not be the default. [~jjirsa], any opinion ? > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972898#comment-15972898 ] Corentin Chary commented on CASSANDRA-13418: Here is an attempt at a patch: https://github.com/iksaif/cassandra/tree/CASSANDRA-13005-trunk Works with: {code} ALTER TABLE test.test WITH compaction = {'class': 'TimeWindowCompactionStrategy', 'provide_overlapping_tombstones': 'ignore_overlaps'}; {code} This outputs: {code} WARN [CompactionExecutor:4] 2017-04-18 17:17:00,538 CompactionController.java:96 - You are running with overlapping sstable sanity checks for tombstones disabled on test:test,this can lead to inconsistencies when running explicit deletions. {code} I'm still not sure about reusing the existing option, I could be conviced overwise (but it should not be hard to change). Once we agree on that I can add documentation and unit tests. > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972825#comment-15972825 ] Corentin Chary commented on CASSANDRA-13432: Latest patch https://github.com/iksaif/cassandra/commits/CASSANDRA-13432-2.x > MemtableReclaimMemory can get stuck because of lack of timeout in > getTopLevelColumns() > -- > > Key: CASSANDRA-13432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13432 > Project: Cassandra > Issue Type: Bug > Environment: cassandra 2.1.15 >Reporter: Corentin Chary > Fix For: 2.1.x > > Attachments: CASSANDRA-13432.patch > > > This might affect 3.x too, I'm not sure. > {code} > $ nodetool tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 32135875 0 > 0 > ReadStage 114 0 29492940 0 > 0 > RequestResponseStage 0 0 86090931 0 > 0 > ReadRepairStage 0 0 166645 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 47 0 > 0 > GossipStage 0 0 188769 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor0 0 86835 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0 92 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0563 0 > 0 > MemtablePostFlush 0 0 1500 0 > 0 > MemtableReclaimMemory 129534 0 > 0 > Native-Transport-Requests41 0 54819182 0 > 1896 > {code} > {code} > "MemtableReclaimMemory:195" - Thread t@6268 >java.lang.Thread.State: WAITING > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) > at > org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) > at > org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "SharedPool-Worker-195" - Thread t@989 >java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) > at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:240) > at > org.apache.cassandra.db.ArrayBackedSortedColumns.delete(ArrayBackedSortedColumns.java:483) > at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:153) > at > org.apac
[jira] [Updated] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13432: --- Status: Patch Available (was: Open) > MemtableReclaimMemory can get stuck because of lack of timeout in > getTopLevelColumns() > -- > > Key: CASSANDRA-13432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13432 > Project: Cassandra > Issue Type: Bug > Environment: cassandra 2.1.15 >Reporter: Corentin Chary > Fix For: 2.1.x > > Attachments: CASSANDRA-13432.patch > > > This might affect 3.x too, I'm not sure. > {code} > $ nodetool tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 32135875 0 > 0 > ReadStage 114 0 29492940 0 > 0 > RequestResponseStage 0 0 86090931 0 > 0 > ReadRepairStage 0 0 166645 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 47 0 > 0 > GossipStage 0 0 188769 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor0 0 86835 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0 92 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0563 0 > 0 > MemtablePostFlush 0 0 1500 0 > 0 > MemtableReclaimMemory 129534 0 > 0 > Native-Transport-Requests41 0 54819182 0 > 1896 > {code} > {code} > "MemtableReclaimMemory:195" - Thread t@6268 >java.lang.Thread.State: WAITING > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) > at > org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) > at > org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "SharedPool-Worker-195" - Thread t@989 >java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) > at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:240) > at > org.apache.cassandra.db.ArrayBackedSortedColumns.delete(ArrayBackedSortedColumns.java:483) > at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:153) > at > org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:184) > at > org.apache.ca
[jira] [Updated] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13432: --- Attachment: CASSANDRA-13432.patch > MemtableReclaimMemory can get stuck because of lack of timeout in > getTopLevelColumns() > -- > > Key: CASSANDRA-13432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13432 > Project: Cassandra > Issue Type: Bug > Environment: cassandra 2.1.15 >Reporter: Corentin Chary > Fix For: 2.1.x > > Attachments: CASSANDRA-13432.patch > > > This might affect 3.x too, I'm not sure. > {code} > $ nodetool tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 32135875 0 > 0 > ReadStage 114 0 29492940 0 > 0 > RequestResponseStage 0 0 86090931 0 > 0 > ReadRepairStage 0 0 166645 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 47 0 > 0 > GossipStage 0 0 188769 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor0 0 86835 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0 92 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0563 0 > 0 > MemtablePostFlush 0 0 1500 0 > 0 > MemtableReclaimMemory 129534 0 > 0 > Native-Transport-Requests41 0 54819182 0 > 1896 > {code} > {code} > "MemtableReclaimMemory:195" - Thread t@6268 >java.lang.Thread.State: WAITING > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) > at > org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) > at > org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "SharedPool-Worker-195" - Thread t@989 >java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) > at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:240) > at > org.apache.cassandra.db.ArrayBackedSortedColumns.delete(ArrayBackedSortedColumns.java:483) > at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:153) > at > org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:184) > at > org.apache.cassa
[jira] [Comment Edited] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967256#comment-15967256 ] Corentin Chary edited comment on CASSANDRA-13432 at 4/13/17 7:54 AM: - Tried the patch, setting the tombstone threshold to one: {code} ERROR [SharedPool-Worker-4] 2017-04-13 09:51:55,891 QueryFilter.java:201 - Scanned over 1 tombstones in system.size_estimates for key: unknown; query aborted (see tombstone_failure_threshold). WARN [SharedPool-Worker-4] 2017-04-13 09:51:55,894 AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread Thread[SharedPool-Worker-4,10,main]: {} java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2249) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121] at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) ~[main/:na] at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$TraceSessionFutureTask.run(AbstractTracingAwareExecutorService.java:136) [main/:na] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [main/:na] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] Caused by: org.apache.cassandra.db.filter.TombstoneOverwhelmingException: null at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:202) ~[main/:na] at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:163) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) ~[main/:na] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:263) ~[main/:na] at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:114) ~[main/:na] at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:88) ~[main/:na] at org.apache.cassandra.db.RowIteratorFactory$2.getReduced(RowIteratorFactory.java:99) ~[main/:na] at org.apache.cassandra.db.RowIteratorFactory$2.getReduced(RowIteratorFactory.java:71) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:117) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:100) ~[main/:na] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.db.ColumnFamilyStore$9.computeNext(ColumnFamilyStore.java:2115) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore$9.computeNext(ColumnFamilyStore.java:2111) ~[main/:na] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.db.ColumnFamilyStore.filter(ColumnFamilyStore.java:2266) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:2224) ~[main/:na] at org.apache.cassandra.db.PagedRangeCommand.executeLocally(PagedRangeCommand.java:115) ~[main/:na] at org.apache.cassandra.service.StorageProxy$LocalRangeSliceRunnable.runMayThrow(StorageProxy.java:1572) ~[main/:na] at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2246) ~[main/:na] ... 5 common frames omitted {code} was (Author: iksaif): Tried the patch, setting the tombstone threshold to one: {code} WARN [SharedPool-Worker-4] 2017-04-13 09:51:55,894 AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread Thread[SharedPool-Worker-4,10,main]: {} java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2249) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121]
[jira] [Commented] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967256#comment-15967256 ] Corentin Chary commented on CASSANDRA-13432: Tried the patch, setting the tombstone threshold to one: {code} WARN [SharedPool-Worker-4] 2017-04-13 09:51:55,894 AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread Thread[SharedPool-Worker-4,10,main]: {} java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2249) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121] at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) ~[main/:na] at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$TraceSessionFutureTask.run(AbstractTracingAwareExecutorService.java:136) [main/:na] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [main/:na] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] Caused by: org.apache.cassandra.db.filter.TombstoneOverwhelmingException: null at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:202) ~[main/:na] at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:163) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) ~[main/:na] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:263) ~[main/:na] at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:114) ~[main/:na] at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:88) ~[main/:na] at org.apache.cassandra.db.RowIteratorFactory$2.getReduced(RowIteratorFactory.java:99) ~[main/:na] at org.apache.cassandra.db.RowIteratorFactory$2.getReduced(RowIteratorFactory.java:71) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:117) ~[main/:na] at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:100) ~[main/:na] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.db.ColumnFamilyStore$9.computeNext(ColumnFamilyStore.java:2115) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore$9.computeNext(ColumnFamilyStore.java:2111) ~[main/:na] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.db.ColumnFamilyStore.filter(ColumnFamilyStore.java:2266) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:2224) ~[main/:na] at org.apache.cassandra.db.PagedRangeCommand.executeLocally(PagedRangeCommand.java:115) ~[main/:na] at org.apache.cassandra.service.StorageProxy$LocalRangeSliceRunnable.runMayThrow(StorageProxy.java:1572) ~[main/:na] at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2246) ~[main/:na] ... 5 common frames omitted {code} > MemtableReclaimMemory can get stuck because of lack of timeout in > getTopLevelColumns() > -- > > Key: CASSANDRA-13432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13432 > Project: Cassandra > Issue Type: Bug > Environment: cassandra 2.1.15 >Reporter: Corentin Chary > Fix For: 2.1.x > > > This might affect 3.x too, I'm not sure. > {code} > $ nodetool tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 32135875 0 > 0 > ReadStage 114 0 29492940 0 >
[jira] [Commented] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965534#comment-15965534 ] Corentin Chary commented on CASSANDRA-13432: It's 2.1.15 but I don't believe it has been fixed. I believe that it's stuck in org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:156) which doesn't count tombstones. A simple patch could be something like: {code} diff --git a/src/java/org/apache/cassandra/db/filter/QueryFilter.java b/src/java/org/apache/cassandra/db/filter/QueryFilter.java index db531a5..8b718db 100644 --- a/src/java/org/apache/cassandra/db/filter/QueryFilter.java +++ b/src/java/org/apache/cassandra/db/filter/QueryFilter.java @@ -23,6 +23,10 @@ import java.util.Iterator; import java.util.List; import java.util.SortedSet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import org.apache.cassandra.config.DatabaseDescriptor; import org.apache.cassandra.db.Cell; import org.apache.cassandra.db.ColumnFamily; import org.apache.cassandra.db.DecoratedKey; @@ -34,10 +38,12 @@ import org.apache.cassandra.db.columniterator.OnDiskAtomIterator; import org.apache.cassandra.db.composites.CellName; import org.apache.cassandra.db.composites.Composite; import org.apache.cassandra.io.sstable.SSTableReader; +import org.apache.cassandra.tracing.Tracing; import org.apache.cassandra.utils.MergeIterator; public class QueryFilter { +private static final Logger logger = LoggerFactory.getLogger(QueryFilter.class); public final DecoratedKey key; public final String cfName; public final IDiskAtomFilter filter; @@ -147,6 +153,7 @@ public class QueryFilter return new Iterator() { private Cell next; +private int tombstoneCount = 0; public boolean hasNext() { @@ -181,6 +188,19 @@ public class QueryFilter } else { +tombstoneCount++; +if (tombstoneCount > DatabaseDescriptor.getTombstoneFailureThreshold()) +{ +Tracing.trace("Scanned over {} tombstones; query aborted (see tombstone_failure_threshold)", + DatabaseDescriptor.getTombstoneFailureThre
[jira] [Commented] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964465#comment-15964465 ] Corentin Chary commented on CASSANDRA-13432: I checked, 3.x has a different code to count tombstones so it's likely not affected > MemtableReclaimMemory can get stuck because of lack of timeout in > getTopLevelColumns() > -- > > Key: CASSANDRA-13432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13432 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 2.1.x > > > This might affect 3.x too, I'm not sure. > {code} > $ nodetool tpstats > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 0 32135875 0 > 0 > ReadStage 114 0 29492940 0 > 0 > RequestResponseStage 0 0 86090931 0 > 0 > ReadRepairStage 0 0 166645 0 > 0 > CounterMutationStage 0 0 0 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 47 0 > 0 > GossipStage 0 0 188769 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor0 0 86835 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 0 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0 92 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 0 0563 0 > 0 > MemtablePostFlush 0 0 1500 0 > 0 > MemtableReclaimMemory 129534 0 > 0 > Native-Transport-Requests41 0 54819182 0 > 1896 > {code} > {code} > "MemtableReclaimMemory:195" - Thread t@6268 >java.lang.Thread.State: WAITING > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > at > org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) > at > org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) > at > org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "SharedPool-Worker-195" - Thread t@989 >java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) > at > org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) > at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:240) > at > org.apache.cassandra.db.ArrayBackedSortedColumns.delete(ArrayBackedSortedColumns.java:483) > at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:153) > at > org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:184) > at
[jira] [Updated] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13432: --- Description: This might affect 3.x too, I'm not sure. {code} $ nodetool tpstats Pool NameActive Pending Completed Blocked All time blocked MutationStage 0 0 32135875 0 0 ReadStage 114 0 29492940 0 0 RequestResponseStage 0 0 86090931 0 0 ReadRepairStage 0 0 166645 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 HintedHandoff 0 0 47 0 0 GossipStage 0 0 188769 0 0 CacheCleanupExecutor 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 CommitLogArchiver 0 0 0 0 0 CompactionExecutor0 0 86835 0 0 ValidationExecutor0 0 0 0 0 MigrationStage0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 PendingRangeCalculator0 0 92 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0563 0 0 MemtablePostFlush 0 0 1500 0 0 MemtableReclaimMemory 129534 0 0 Native-Transport-Requests41 0 54819182 0 1896 {code} {code} "MemtableReclaimMemory:195" - Thread t@6268 java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) at org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) at org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) at org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) "SharedPool-Worker-195" - Thread t@989 java.lang.Thread.State: RUNNABLE at org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) at org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) at org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) at org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:240) at org.apache.cassandra.db.ArrayBackedSortedColumns.delete(ArrayBackedSortedColumns.java:483) at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:153) at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:184) at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:156) at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146) at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125) at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:263) at org.apache.
[jira] [Updated] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
[ https://issues.apache.org/jira/browse/CASSANDRA-13432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13432: --- Description: This might affect 3.x too, I'm not sure. {code} $ nodetool tpstats Pool NameActive Pending Completed Blocked All time blocked MutationStage 0 0 32135875 0 0 ReadStage 114 0 29492940 0 0 RequestResponseStage 0 0 86090931 0 0 ReadRepairStage 0 0 166645 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 HintedHandoff 0 0 47 0 0 GossipStage 0 0 188769 0 0 CacheCleanupExecutor 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 CommitLogArchiver 0 0 0 0 0 CompactionExecutor0 0 86835 0 0 ValidationExecutor0 0 0 0 0 MigrationStage0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 PendingRangeCalculator0 0 92 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0563 0 0 MemtablePostFlush 0 0 1500 0 0 MemtableReclaimMemory 129534 0 0 Native-Transport-Requests41 0 54819182 0 1896 {code} {code} "MemtableReclaimMemory:195" - Thread t@6268 java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) at org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) at org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) at org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) "SharedPool-Worker-195" - Thread t@989 java.lang.Thread.State: RUNNABLE at org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) at org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) at org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) at org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:240) at org.apache.cassandra.db.ArrayBackedSortedColumns.delete(ArrayBackedSortedColumns.java:483) at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:153) at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:184) at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:156) at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146) at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125) at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:263) at org.apache.
[jira] [Created] (CASSANDRA-13432) MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns()
Corentin Chary created CASSANDRA-13432: -- Summary: MemtableReclaimMemory can get stuck because of lack of timeout in getTopLevelColumns() Key: CASSANDRA-13432 URL: https://issues.apache.org/jira/browse/CASSANDRA-13432 Project: Cassandra Issue Type: Bug Reporter: Corentin Chary Fix For: 2.1.x This might affect 3.x too, I'm not sure. {code} $ nodetool tpstats Pool NameActive Pending Completed Blocked All time blocked MutationStage 0 0 32135875 0 0 ReadStage 114 0 29492940 0 0 RequestResponseStage 0 0 86090931 0 0 ReadRepairStage 0 0 166645 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 HintedHandoff 0 0 47 0 0 GossipStage 0 0 188769 0 0 CacheCleanupExecutor 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 CommitLogArchiver 0 0 0 0 0 CompactionExecutor0 0 86835 0 0 ValidationExecutor0 0 0 0 0 MigrationStage0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 PendingRangeCalculator0 0 92 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0563 0 0 MemtablePostFlush 0 0 1500 0 0 MemtableReclaimMemory 129534 0 0 Native-Transport-Requests41 0 54819182 0 1896 {code} {code} "MemtableReclaimMemory:195" - Thread t@6268 java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) at org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:283) at org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:417) at org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1151) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - locked <6e7b1160> (a java.util.concurrent.ThreadPoolExecutor$Worker) "SharedPool-Worker-195" - Thread t@989 java.lang.Thread.State: RUNNABLE at org.apache.cassandra.db.RangeTombstoneList.addInternal(RangeTombstoneList.java:690) at org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:650) at org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:171) at org.apache.cassandra.db.RangeTombstoneList.add(RangeTombstoneList.java:143) at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:240) at org.apache.cassandra.db.ArrayBackedSortedColumns.delete(ArrayBackedSortedColumns.java:483) at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:153) at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:184) at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:156) at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146) at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125) at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.goog
[jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962212#comment-15962212 ] Corentin Chary commented on CASSANDRA-13418: If I understand things correctly here, the worst that can happen is that data could re-appear. Remember that we just drop SSTables were *all* the items have expired. (The worst that can happen if you don't have the option is that you suddently stop dropping SSTables and all your disk full up.) > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962183#comment-15962183 ] Corentin Chary commented on CASSANDRA-13418: AFAIK provide_overlapping_tombstones is a compaction property that we already have. I'm suggesting to add "ignore" on top of the existing "none" (default), "cell" and "row" values. > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13418) Allow TWCS to ignore overlaps when dropping fully expired sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13418: --- Summary: Allow TWCS to ignore overlaps when dropping fully expired sstables (was: Allow TWCS to ignore overlaps) > Allow TWCS to ignore overlaps when dropping fully expired sstables > -- > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-12962) SASI: Index are rebuilt on restart
[ https://issues.apache.org/jira/browse/CASSANDRA-12962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958936#comment-15958936 ] Corentin Chary edited comment on CASSANDRA-12962 at 4/6/17 1:45 PM: Sure. I looked at the patch again and it looks prefectly fine. was (Author: iksaif): Sure > SASI: Index are rebuilt on restart > -- > > Key: CASSANDRA-12962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12962 > Project: Cassandra > Issue Type: Bug > Components: sasi >Reporter: Corentin Chary >Assignee: Alex Petrov >Priority: Minor > Fix For: 3.11.x > > Attachments: screenshot-1.png > > > Apparently when cassandra any index that does not index a value in *every* > live SSTable gets rebuild. The offending code can be found in the constructor > of SASIIndex. > You can easilly reproduce it: > {code} > CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true; > CREATE TABLE test.test ( > a text PRIMARY KEY, > b text, > c text > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND compression = {'chunk_length_in_kb': '64', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND crc_check_chance = 1.0 > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99PERCENTILE'; > CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > INSERT INTO test.test (a, b) VALUES ('a', 'b'); > {code} > Log (I added additional traces): > {code} > INFO [main] 2016-11-28 15:32:21,191 ColumnFamilyStore.java:406 - > Initializing test.test > DEBUG [SSTableBatchOpen:1] 2016-11-28 15:32:21,192 SSTableReader.java:505 - > Opening > /mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big > (0.034KiB) > DEBUG [main] 2016-11-28 15:32:21,194 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@2f661b1a[id=6b00489b-7010-396e-9348-9f32f5167f88,name=test_b_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=b}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > INFO [main] 2016-11-28 15:32:21,194 DataTracker.java:152 - > SSTableIndex.open(column: b, minTerm: value, maxTerm: value, minKey: key, > maxKey: key, sstable: BigTableReader(path='/mnt/ssd/tmp/data/data/test/test\ > -229e6380b57711e68407158fde22e121/mc-1-big-Data.db')) > DEBUG [main] 2016-11-28 15:32:21,195 SASIIndex.java:129 - Rebuilding SASI > Indexes: {} > DEBUG [main] 2016-11-28 15:32:21,195 ColumnFamilyStore.java:895 - Enqueuing > flush of IndexInfo: 0.386KiB (0%) on-heap, 0.000KiB (0%) off-heap > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:465 - Writing Memtable-IndexInfo@748981977(0.054KiB serialized > bytes, 1 ops, 0%/0% of on/off-heap limit), flushed range = (min(-9223\ > 372036854775808), max(9223372036854775807)] > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:494 - Completed flushing > /mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db > (0.035KiB) for\ > commitlog position CommitLogPosition(segmentId=1480343535479, position=15652) > DEBUG [MemtableFlushWriter:1] 2016-11-28 15:32:21,224 > ColumnFamilyStore.java:1200 - Flushed to > [BigTableReader(path='/mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db\ > ')] (1 sstables, 4.838KiB), biggest 4.838KiB, smallest 4.838KiB > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@12f3d291[id=45fcb286-b87a-3d18-a04b-b899a9880c91,name=test_c_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=c}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:121 - to rebuild: index: > BigTableReader(path='/mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big-Data.db'), > sstable: org.apache.cassa\ > ndra.index.sasi.conf.ColumnIndex@6cb
[jira] [Updated] (CASSANDRA-12962) SASI: Index are rebuilt on restart
[ https://issues.apache.org/jira/browse/CASSANDRA-12962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-12962: --- Reviewer: Corentin Chary > SASI: Index are rebuilt on restart > -- > > Key: CASSANDRA-12962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12962 > Project: Cassandra > Issue Type: Bug > Components: sasi >Reporter: Corentin Chary >Assignee: Alex Petrov >Priority: Minor > Fix For: 3.11.x > > Attachments: screenshot-1.png > > > Apparently when cassandra any index that does not index a value in *every* > live SSTable gets rebuild. The offending code can be found in the constructor > of SASIIndex. > You can easilly reproduce it: > {code} > CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true; > CREATE TABLE test.test ( > a text PRIMARY KEY, > b text, > c text > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND compression = {'chunk_length_in_kb': '64', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND crc_check_chance = 1.0 > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99PERCENTILE'; > CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > INSERT INTO test.test (a, b) VALUES ('a', 'b'); > {code} > Log (I added additional traces): > {code} > INFO [main] 2016-11-28 15:32:21,191 ColumnFamilyStore.java:406 - > Initializing test.test > DEBUG [SSTableBatchOpen:1] 2016-11-28 15:32:21,192 SSTableReader.java:505 - > Opening > /mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big > (0.034KiB) > DEBUG [main] 2016-11-28 15:32:21,194 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@2f661b1a[id=6b00489b-7010-396e-9348-9f32f5167f88,name=test_b_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=b}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > INFO [main] 2016-11-28 15:32:21,194 DataTracker.java:152 - > SSTableIndex.open(column: b, minTerm: value, maxTerm: value, minKey: key, > maxKey: key, sstable: BigTableReader(path='/mnt/ssd/tmp/data/data/test/test\ > -229e6380b57711e68407158fde22e121/mc-1-big-Data.db')) > DEBUG [main] 2016-11-28 15:32:21,195 SASIIndex.java:129 - Rebuilding SASI > Indexes: {} > DEBUG [main] 2016-11-28 15:32:21,195 ColumnFamilyStore.java:895 - Enqueuing > flush of IndexInfo: 0.386KiB (0%) on-heap, 0.000KiB (0%) off-heap > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:465 - Writing Memtable-IndexInfo@748981977(0.054KiB serialized > bytes, 1 ops, 0%/0% of on/off-heap limit), flushed range = (min(-9223\ > 372036854775808), max(9223372036854775807)] > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:494 - Completed flushing > /mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db > (0.035KiB) for\ > commitlog position CommitLogPosition(segmentId=1480343535479, position=15652) > DEBUG [MemtableFlushWriter:1] 2016-11-28 15:32:21,224 > ColumnFamilyStore.java:1200 - Flushed to > [BigTableReader(path='/mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db\ > ')] (1 sstables, 4.838KiB), biggest 4.838KiB, smallest 4.838KiB > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@12f3d291[id=45fcb286-b87a-3d18-a04b-b899a9880c91,name=test_c_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=c}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:121 - to rebuild: index: > BigTableReader(path='/mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big-Data.db'), > sstable: org.apache.cassa\ > ndra.index.sasi.conf.ColumnIndex@6cbb6b0e > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:129 - Rebuilding SASI > Indexes: > {BigTableReader(path='/mnt/ssd/tmp/data/data/test/test-229e6380b57711e6
[jira] [Commented] (CASSANDRA-12962) SASI: Index are rebuilt on restart
[ https://issues.apache.org/jira/browse/CASSANDRA-12962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958936#comment-15958936 ] Corentin Chary commented on CASSANDRA-12962: Sure > SASI: Index are rebuilt on restart > -- > > Key: CASSANDRA-12962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12962 > Project: Cassandra > Issue Type: Bug > Components: sasi >Reporter: Corentin Chary >Assignee: Alex Petrov >Priority: Minor > Fix For: 3.11.x > > Attachments: screenshot-1.png > > > Apparently when cassandra any index that does not index a value in *every* > live SSTable gets rebuild. The offending code can be found in the constructor > of SASIIndex. > You can easilly reproduce it: > {code} > CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true; > CREATE TABLE test.test ( > a text PRIMARY KEY, > b text, > c text > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND compression = {'chunk_length_in_kb': '64', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND crc_check_chance = 1.0 > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99PERCENTILE'; > CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > INSERT INTO test.test (a, b) VALUES ('a', 'b'); > {code} > Log (I added additional traces): > {code} > INFO [main] 2016-11-28 15:32:21,191 ColumnFamilyStore.java:406 - > Initializing test.test > DEBUG [SSTableBatchOpen:1] 2016-11-28 15:32:21,192 SSTableReader.java:505 - > Opening > /mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big > (0.034KiB) > DEBUG [main] 2016-11-28 15:32:21,194 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@2f661b1a[id=6b00489b-7010-396e-9348-9f32f5167f88,name=test_b_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=b}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > INFO [main] 2016-11-28 15:32:21,194 DataTracker.java:152 - > SSTableIndex.open(column: b, minTerm: value, maxTerm: value, minKey: key, > maxKey: key, sstable: BigTableReader(path='/mnt/ssd/tmp/data/data/test/test\ > -229e6380b57711e68407158fde22e121/mc-1-big-Data.db')) > DEBUG [main] 2016-11-28 15:32:21,195 SASIIndex.java:129 - Rebuilding SASI > Indexes: {} > DEBUG [main] 2016-11-28 15:32:21,195 ColumnFamilyStore.java:895 - Enqueuing > flush of IndexInfo: 0.386KiB (0%) on-heap, 0.000KiB (0%) off-heap > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:465 - Writing Memtable-IndexInfo@748981977(0.054KiB serialized > bytes, 1 ops, 0%/0% of on/off-heap limit), flushed range = (min(-9223\ > 372036854775808), max(9223372036854775807)] > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:494 - Completed flushing > /mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db > (0.035KiB) for\ > commitlog position CommitLogPosition(segmentId=1480343535479, position=15652) > DEBUG [MemtableFlushWriter:1] 2016-11-28 15:32:21,224 > ColumnFamilyStore.java:1200 - Flushed to > [BigTableReader(path='/mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db\ > ')] (1 sstables, 4.838KiB), biggest 4.838KiB, smallest 4.838KiB > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@12f3d291[id=45fcb286-b87a-3d18-a04b-b899a9880c91,name=test_c_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=c}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:121 - to rebuild: index: > BigTableReader(path='/mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big-Data.db'), > sstable: org.apache.cassa\ > ndra.index.sasi.conf.ColumnIndex@6cbb6b0e > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:129 - Rebuilding SASI > Indexes: > {BigTableReader(path='/mnt/ssd/tmp/da
[jira] [Commented] (CASSANDRA-13418) Allow TWCS to ignore overlaps
[ https://issues.apache.org/jira/browse/CASSANDRA-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958337#comment-15958337 ] Corentin Chary commented on CASSANDRA-13418: What do you think about provide_overlapping_tombstones = "ignore" ? This is a little would integrate nicely with the code and does not add yet another compaction option (but sounds a little weird). > Allow TWCS to ignore overlaps > - > > Key: CASSANDRA-13418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Corentin Chary > Labels: twcs > > http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If > you really want read-repairs you're going to have sstables blocking the > expiration of other fully expired SSTables because they overlap. > You can set unchecked_tombstone_compaction = true or tombstone_threshold to a > very low value and that will purge the blockers of old data that should > already have expired, thus removing the overlaps and allowing the other > SSTables to expire. > The thing is that this is rather CPU intensive and not optimal. If you have > time series, you might not care if all your data doesn't exactly expire at > the right time, or if data re-appears for some time, as long as it gets > deleted as soon as it can. And in this situation I believe it would be really > beneficial to allow users to simply ignore overlapping SSTables when looking > for fully expired ones. > To the question: why would you need read-repairs ? > - Full repairs basically take longer than the TTL of the data on my dataset, > so this isn't really effective. > - Even with a 10% chances of doing a repair, we found out that this would be > enough to greatly reduce entropy of the most used data (and if you have > timeseries, you're likely to have a dashboard doing the same important > queries over and over again). > - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. > I'll try to come up with a patch demonstrating how this would work, try it on > our system and report the effects. > cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CASSANDRA-13418) Allow TWCS to ignore overlaps
Corentin Chary created CASSANDRA-13418: -- Summary: Allow TWCS to ignore overlaps Key: CASSANDRA-13418 URL: https://issues.apache.org/jira/browse/CASSANDRA-13418 Project: Cassandra Issue Type: Improvement Components: Compaction Reporter: Corentin Chary http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html explains it well. If you really want read-repairs you're going to have sstables blocking the expiration of other fully expired SSTables because they overlap. You can set unchecked_tombstone_compaction = true or tombstone_threshold to a very low value and that will purge the blockers of old data that should already have expired, thus removing the overlaps and allowing the other SSTables to expire. The thing is that this is rather CPU intensive and not optimal. If you have time series, you might not care if all your data doesn't exactly expire at the right time, or if data re-appears for some time, as long as it gets deleted as soon as it can. And in this situation I believe it would be really beneficial to allow users to simply ignore overlapping SSTables when looking for fully expired ones. To the question: why would you need read-repairs ? - Full repairs basically take longer than the TTL of the data on my dataset, so this isn't really effective. - Even with a 10% chances of doing a repair, we found out that this would be enough to greatly reduce entropy of the most used data (and if you have timeseries, you're likely to have a dashboard doing the same important queries over and over again). - LOCAL_QUORUM is too expensive (need >3 replicas), QUORUM is too slow. I'll try to come up with a patch demonstrating how this would work, try it on our system and report the effects. cc: [~adejanovski], [~rgerard] as I know you worked on similar issues already. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12962) SASI: Index are rebuilt on restart
[ https://issues.apache.org/jira/browse/CASSANDRA-12962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943403#comment-15943403 ] Corentin Chary commented on CASSANDRA-12962: Looks robust enough to me :) > SASI: Index are rebuilt on restart > -- > > Key: CASSANDRA-12962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12962 > Project: Cassandra > Issue Type: Bug > Components: sasi >Reporter: Corentin Chary >Assignee: Alex Petrov >Priority: Minor > Fix For: 3.11.x > > Attachments: screenshot-1.png > > > Apparently when cassandra any index that does not index a value in *every* > live SSTable gets rebuild. The offending code can be found in the constructor > of SASIIndex. > You can easilly reproduce it: > {code} > CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true; > CREATE TABLE test.test ( > a text PRIMARY KEY, > b text, > c text > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND compression = {'chunk_length_in_kb': '64', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND crc_check_chance = 1.0 > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99PERCENTILE'; > CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > INSERT INTO test.test (a, b) VALUES ('a', 'b'); > {code} > Log (I added additional traces): > {code} > INFO [main] 2016-11-28 15:32:21,191 ColumnFamilyStore.java:406 - > Initializing test.test > DEBUG [SSTableBatchOpen:1] 2016-11-28 15:32:21,192 SSTableReader.java:505 - > Opening > /mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big > (0.034KiB) > DEBUG [main] 2016-11-28 15:32:21,194 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@2f661b1a[id=6b00489b-7010-396e-9348-9f32f5167f88,name=test_b_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=b}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > INFO [main] 2016-11-28 15:32:21,194 DataTracker.java:152 - > SSTableIndex.open(column: b, minTerm: value, maxTerm: value, minKey: key, > maxKey: key, sstable: BigTableReader(path='/mnt/ssd/tmp/data/data/test/test\ > -229e6380b57711e68407158fde22e121/mc-1-big-Data.db')) > DEBUG [main] 2016-11-28 15:32:21,195 SASIIndex.java:129 - Rebuilding SASI > Indexes: {} > DEBUG [main] 2016-11-28 15:32:21,195 ColumnFamilyStore.java:895 - Enqueuing > flush of IndexInfo: 0.386KiB (0%) on-heap, 0.000KiB (0%) off-heap > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:465 - Writing Memtable-IndexInfo@748981977(0.054KiB serialized > bytes, 1 ops, 0%/0% of on/off-heap limit), flushed range = (min(-9223\ > 372036854775808), max(9223372036854775807)] > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:494 - Completed flushing > /mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db > (0.035KiB) for\ > commitlog position CommitLogPosition(segmentId=1480343535479, position=15652) > DEBUG [MemtableFlushWriter:1] 2016-11-28 15:32:21,224 > ColumnFamilyStore.java:1200 - Flushed to > [BigTableReader(path='/mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db\ > ')] (1 sstables, 4.838KiB), biggest 4.838KiB, smallest 4.838KiB > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@12f3d291[id=45fcb286-b87a-3d18-a04b-b899a9880c91,name=test_c_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=c}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:121 - to rebuild: index: > BigTableReader(path='/mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big-Data.db'), > sstable: org.apache.cassa\ > ndra.index.sasi.conf.ColumnIndex@6cbb6b0e > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:129 - Rebuilding SASI > Indexes: > {BigTableRead
[jira] [Commented] (CASSANDRA-12962) SASI: Index are rebuilt on restart
[ https://issues.apache.org/jira/browse/CASSANDRA-12962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935171#comment-15935171 ] Corentin Chary commented on CASSANDRA-12962: Alex: I do not expect to have time to work on that in the next weeks, so feel free to take it :) > SASI: Index are rebuilt on restart > -- > > Key: CASSANDRA-12962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12962 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Priority: Minor > Fix For: 3.11.x > > > Apparently when cassandra any index that does not index a value in *every* > live SSTable gets rebuild. The offending code can be found in the constructor > of SASIIndex. > You can easilly reproduce it: > {code} > CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true; > CREATE TABLE test.test ( > a text PRIMARY KEY, > b text, > c text > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND compression = {'chunk_length_in_kb': '64', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND crc_check_chance = 1.0 > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99PERCENTILE'; > CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > INSERT INTO test.test (a, b) VALUES ('a', 'b'); > {code} > Log (I added additional traces): > {code} > INFO [main] 2016-11-28 15:32:21,191 ColumnFamilyStore.java:406 - > Initializing test.test > DEBUG [SSTableBatchOpen:1] 2016-11-28 15:32:21,192 SSTableReader.java:505 - > Opening > /mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big > (0.034KiB) > DEBUG [main] 2016-11-28 15:32:21,194 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@2f661b1a[id=6b00489b-7010-396e-9348-9f32f5167f88,name=test_b_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=b}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > INFO [main] 2016-11-28 15:32:21,194 DataTracker.java:152 - > SSTableIndex.open(column: b, minTerm: value, maxTerm: value, minKey: key, > maxKey: key, sstable: BigTableReader(path='/mnt/ssd/tmp/data/data/test/test\ > -229e6380b57711e68407158fde22e121/mc-1-big-Data.db')) > DEBUG [main] 2016-11-28 15:32:21,195 SASIIndex.java:129 - Rebuilding SASI > Indexes: {} > DEBUG [main] 2016-11-28 15:32:21,195 ColumnFamilyStore.java:895 - Enqueuing > flush of IndexInfo: 0.386KiB (0%) on-heap, 0.000KiB (0%) off-heap > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:465 - Writing Memtable-IndexInfo@748981977(0.054KiB serialized > bytes, 1 ops, 0%/0% of on/off-heap limit), flushed range = (min(-9223\ > 372036854775808), max(9223372036854775807)] > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:494 - Completed flushing > /mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db > (0.035KiB) for\ > commitlog position CommitLogPosition(segmentId=1480343535479, position=15652) > DEBUG [MemtableFlushWriter:1] 2016-11-28 15:32:21,224 > ColumnFamilyStore.java:1200 - Flushed to > [BigTableReader(path='/mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db\ > ')] (1 sstables, 4.838KiB), biggest 4.838KiB, smallest 4.838KiB > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@12f3d291[id=45fcb286-b87a-3d18-a04b-b899a9880c91,name=test_c_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=c}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:121 - to rebuild: index: > BigTableReader(path='/mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big-Data.db'), > sstable: org.apache.cassa\ > ndra.index.sasi.conf.ColumnIndex@6cbb6b0e > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:129 - Rebuilding SASI > Indexes: > {BigTableRead
[jira] [Commented] (CASSANDRA-12962) SASI: Index are rebuilt on restart
[ https://issues.apache.org/jira/browse/CASSANDRA-12962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935166#comment-15935166 ] Corentin Chary commented on CASSANDRA-12962: Exact. In my degenerate case I had 64 columns, all indexed but most data was sparse. This lead to ~2h rebuilds after each restarts. > SASI: Index are rebuilt on restart > -- > > Key: CASSANDRA-12962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12962 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Priority: Minor > Fix For: 3.11.x > > > Apparently when cassandra any index that does not index a value in *every* > live SSTable gets rebuild. The offending code can be found in the constructor > of SASIIndex. > You can easilly reproduce it: > {code} > CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true; > CREATE TABLE test.test ( > a text PRIMARY KEY, > b text, > c text > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND compression = {'chunk_length_in_kb': '64', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND crc_check_chance = 1.0 > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99PERCENTILE'; > CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > INSERT INTO test.test (a, b) VALUES ('a', 'b'); > {code} > Log (I added additional traces): > {code} > INFO [main] 2016-11-28 15:32:21,191 ColumnFamilyStore.java:406 - > Initializing test.test > DEBUG [SSTableBatchOpen:1] 2016-11-28 15:32:21,192 SSTableReader.java:505 - > Opening > /mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big > (0.034KiB) > DEBUG [main] 2016-11-28 15:32:21,194 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@2f661b1a[id=6b00489b-7010-396e-9348-9f32f5167f88,name=test_b_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=b}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > INFO [main] 2016-11-28 15:32:21,194 DataTracker.java:152 - > SSTableIndex.open(column: b, minTerm: value, maxTerm: value, minKey: key, > maxKey: key, sstable: BigTableReader(path='/mnt/ssd/tmp/data/data/test/test\ > -229e6380b57711e68407158fde22e121/mc-1-big-Data.db')) > DEBUG [main] 2016-11-28 15:32:21,195 SASIIndex.java:129 - Rebuilding SASI > Indexes: {} > DEBUG [main] 2016-11-28 15:32:21,195 ColumnFamilyStore.java:895 - Enqueuing > flush of IndexInfo: 0.386KiB (0%) on-heap, 0.000KiB (0%) off-heap > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:465 - Writing Memtable-IndexInfo@748981977(0.054KiB serialized > bytes, 1 ops, 0%/0% of on/off-heap limit), flushed range = (min(-9223\ > 372036854775808), max(9223372036854775807)] > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:494 - Completed flushing > /mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db > (0.035KiB) for\ > commitlog position CommitLogPosition(segmentId=1480343535479, position=15652) > DEBUG [MemtableFlushWriter:1] 2016-11-28 15:32:21,224 > ColumnFamilyStore.java:1200 - Flushed to > [BigTableReader(path='/mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db\ > ')] (1 sstables, 4.838KiB), biggest 4.838KiB, smallest 4.838KiB > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@12f3d291[id=45fcb286-b87a-3d18-a04b-b899a9880c91,name=test_c_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=c}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:121 - to rebuild: index: > BigTableReader(path='/mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big-Data.db'), > sstable: org.apache.cassa\ > ndra.index.sasi.conf.ColumnIndex@6cbb6b0e > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:129 - Rebuildin
[jira] [Created] (CASSANDRA-13338) JMX: EstimatedPartitionCount / SnapshotSize are expensive
Corentin Chary created CASSANDRA-13338: -- Summary: JMX: EstimatedPartitionCount / SnapshotSize are expensive Key: CASSANDRA-13338 URL: https://issues.apache.org/jira/browse/CASSANDRA-13338 Project: Cassandra Issue Type: Improvement Components: Observability Reporter: Corentin Chary EstimatedPartitionCount / EstimatedRowCount / SnapshotSize seem particularly expensive. For example on our system org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize can take as much as half a second. All this cumulated means that export stats for all your tables (with metrics-graphite or jmx_exporter) is going to take quite some time. We should certainly try to find the most expensive end points and see if there is a way to cache some of the values. cc: [~rgerard] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-11380) Client visible backpressure mechanism
[ https://issues.apache.org/jira/browse/CASSANDRA-11380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927952#comment-15927952 ] Corentin Chary commented on CASSANDRA-11380: >From my tests I didn't find a way to create a setup were there would be a fair >backpressure using this (which is an issue when you have a cluster shared by >multiple clients/workloads). > Client visible backpressure mechanism > - > > Key: CASSANDRA-11380 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11380 > Project: Cassandra > Issue Type: New Feature > Components: Coordination >Reporter: Wei Deng > > Cassandra currently lacks a sophisticated back pressure mechanism to prevent > clients ingesting data at too high throughput. One of the reasons why it > hasn't done so is because of its SEDA (Staged Event Driven Architecture) > design. With SEDA, an overloaded thread pool can drop those droppable > messages (in this case, MutationStage can drop mutation or counter mutation > messages) when they exceed the 2-second timeout. This can save the JVM from > running out of memory and crash. However, one downside from this kind of > load-shedding based backpressure approach is that increased number of dropped > mutations will increase the chance of inconsistency among replicas and will > likely require more repair (hints can help to some extent, but it's not > designed to cover all inconsistencies); another downside is that excessive > writes will also introduce much more pressure on compaction (especially LCS), > and backlogged compaction will increase read latency and cause more frequent > GC pauses, and depending on the type of compaction, some backlog can take a > long time to clear up even after the write is removed. It seems that the > current load-shedding mechanism is not adequate to address a common bulk > loading scenario, where clients are trying to ingest data at highest > throughput possible. We need a more direct way to tell the client drivers to > slow down. > It appears that HBase had suffered similar situation as discussed in > HBASE-5162, and they introduced some special exception type to tell the > client to slow down when a certain "overloaded" criteria is met. If we can > leverage a similar mechanism, our dropped mutation event can be used to > trigger such exceptions to push back on the client; at the same time, > backlogged compaction (when the number of pending compactions exceeds a > certain threshold) can also be used for the push back and this can prevent > vicious cycle mentioned in > https://issues.apache.org/jira/browse/CASSANDRA-11366?focusedCommentId=15198786&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15198786. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12962) SASI: Index are rebuilt on restart
[ https://issues.apache.org/jira/browse/CASSANDRA-12962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927949#comment-15927949 ] Corentin Chary commented on CASSANDRA-12962: [~ifesdjeen] any idea why the code was made like that in the first place ? > SASI: Index are rebuilt on restart > -- > > Key: CASSANDRA-12962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12962 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Priority: Minor > Fix For: 3.11.x > > > Apparently when cassandra any index that does not index a value in *every* > live SSTable gets rebuild. The offending code can be found in the constructor > of SASIIndex. > You can easilly reproduce it: > {code} > CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true; > CREATE TABLE test.test ( > a text PRIMARY KEY, > b text, > c text > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND compression = {'chunk_length_in_kb': '64', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND crc_check_chance = 1.0 > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99PERCENTILE'; > CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING > 'org.apache.cassandra.index.sasi.SASIIndex'; > INSERT INTO test.test (a, b) VALUES ('a', 'b'); > {code} > Log (I added additional traces): > {code} > INFO [main] 2016-11-28 15:32:21,191 ColumnFamilyStore.java:406 - > Initializing test.test > DEBUG [SSTableBatchOpen:1] 2016-11-28 15:32:21,192 SSTableReader.java:505 - > Opening > /mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big > (0.034KiB) > DEBUG [main] 2016-11-28 15:32:21,194 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@2f661b1a[id=6b00489b-7010-396e-9348-9f32f5167f88,name=test_b_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=b}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > INFO [main] 2016-11-28 15:32:21,194 DataTracker.java:152 - > SSTableIndex.open(column: b, minTerm: value, maxTerm: value, minKey: key, > maxKey: key, sstable: BigTableReader(path='/mnt/ssd/tmp/data/data/test/test\ > -229e6380b57711e68407158fde22e121/mc-1-big-Data.db')) > DEBUG [main] 2016-11-28 15:32:21,195 SASIIndex.java:129 - Rebuilding SASI > Indexes: {} > DEBUG [main] 2016-11-28 15:32:21,195 ColumnFamilyStore.java:895 - Enqueuing > flush of IndexInfo: 0.386KiB (0%) on-heap, 0.000KiB (0%) off-heap > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:465 - Writing Memtable-IndexInfo@748981977(0.054KiB serialized > bytes, 1 ops, 0%/0% of on/off-heap limit), flushed range = (min(-9223\ > 372036854775808), max(9223372036854775807)] > DEBUG [PerDiskMemtableFlushWriter_0:1] 2016-11-28 15:32:21,204 > Memtable.java:494 - Completed flushing > /mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db > (0.035KiB) for\ > commitlog position CommitLogPosition(segmentId=1480343535479, position=15652) > DEBUG [MemtableFlushWriter:1] 2016-11-28 15:32:21,224 > ColumnFamilyStore.java:1200 - Flushed to > [BigTableReader(path='/mnt/ssd/tmp/data/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-4256-big-Data.db\ > ')] (1 sstables, 4.838KiB), biggest 4.838KiB, smallest 4.838KiB > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:118 - index: > org.apache.cassandra.schema.IndexMetadata@12f3d291[id=45fcb286-b87a-3d18-a04b-b899a9880c91,name=test_c_idx,kind=CUSTOM,options={class_name=org.a\ > pache.cassandra.index.sasi.SASIIndex, target=c}], base CFS(Keyspace='test', > ColumnFamily='test'), tracker > org.apache.cassandra.db.lifecycle.Tracker@15900b83 > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:121 - to rebuild: index: > BigTableReader(path='/mnt/ssd/tmp/data/data/test/test-229e6380b57711e68407158fde22e121/mc-1-big-Data.db'), > sstable: org.apache.cassa\ > ndra.index.sasi.conf.ColumnIndex@6cbb6b0e > DEBUG [main] 2016-11-28 15:32:21,224 SASIIndex.java:129 - Rebuilding SASI > Indexes: > {BigTableReader(path='/mnt/ssd/tmp/d
[jira] [Commented] (CASSANDRA-13189) Use prompt_toolkit in cqlsh
[ https://issues.apache.org/jira/browse/CASSANDRA-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927945#comment-15927945 ] Corentin Chary commented on CASSANDRA-13189: I'll try to add some unit tests and send a more formal patch later this month. But if anybody has time to play with it before, feel free to ! > Use prompt_toolkit in cqlsh > --- > > Key: CASSANDRA-13189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13189 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Attachments: cqlsh-prompt-tookit.png > > > prompt_toolkit is an alternative to readline > (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a > lot of software, including the upcomming version of ipython. > I'm working on an initial version that keeps compatibility with readline, > which is available here: > https://github.com/iksaif/cassandra/tree/prompt_toolkit > It's still missing tests and a few things, but I'm opening this for tracking > and feedbacks. > !cqlsh-prompt-tookit.png|thumbnail! -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903086#comment-15903086 ] Corentin Chary commented on CASSANDRA-12915: LGTM, Thanks for cleaning up, this is way better now > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901335#comment-15901335 ] Corentin Chary commented on CASSANDRA-12915: Looks good now, would be nice to see the results of the CI on this version :) > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900977#comment-15900977 ] Corentin Chary commented on CASSANDRA-12915: {code} this(range == null ? null : range.min, range == null ? null : range.max, range == null ? 0 : range.count);{code} I think it would be better not to make the assumption that null range == empty range. Mostly because it isn't treated the same way in add() {code} If either range is empty. Empty range is a subrange of (overlaps with) any range.{code} That's not how intersection usually works, shouldn't the result of an empty range intersection with anything be an empty range ? (which means that an empty range overlaps with nothing) > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898148#comment-15898148 ] Corentin Chary commented on CASSANDRA-12915: Could you re-phrase the question ? I though I answered everything from [this comment|https://issues.apache.org/jira/browse/CASSANDRA-12915?focusedCommentId=15897393&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15897393] but it looks like I didn't. The idea of my approach is that I'm looking for this behavior: {code} builder = RangeIntersectionIterator.builder(strategy); builder.add(new LongIterator(new long[] {})); builder.add(new LongIterator(new long[] {1})); range = builder.build(); Assert.assertEquals(0, range.getCount()); Assert.assertFalse(range.hasNext()); // (optimized though isOverlapping() returning false {code} In other words, adding an empty iterator to a RangeIntersectionIterator should make it empty and there is a strong different between an empty and null iterator. I believe in your case your empty iterator will just get ignored because you need to remove this check: https://github.com/ifesdjeen/cassandra/blob/78b1ff630536b0f48787ced74a66d702d13637ba/src/java/org/apache/cassandra/index/sasi/utils/RangeIterator.java#L151 > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897545#comment-15897545 ] Corentin Chary commented on CASSANDRA-12915: The fact that you didn't change the following line makes me thing that your patch doesn't really do what we need: Assert.assertEquals(1L, builder.add(new LongIterator(new long[] {})).rangeCount()); Empty ranges really should not get ignored, and the changes made in https://github.com/ifesdjeen/cassandra/commit/78b1ff630536b0f48787ced74a66d702d13637ba#diff-22e58be2cfd42af959cb63c97de7eb3cR246 show that the code do not behave like we would like it to. > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897430#comment-15897430 ] Corentin Chary commented on CASSANDRA-12915: * Removing ranges.isEmpty() happens in another function. Removing it doesn't change anything as forEach() will iterate on an empty list. * True for min() and max(). It's this way for the switch() because computing min / max keys with an empty range doesn't make much sense. Anything else ? If not I'll remove the duplicated code in min() and max() > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-12915: --- Fix Version/s: 4.x Status: Patch Available (was: Open) > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871958#comment-15871958 ] Corentin Chary commented on CASSANDRA-12915: {code} CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true; CREATE TABLE test.test ( r text PRIMARY KEY, a text, b text, c text, data text ); CREATE CUSTOM INDEX test_a_idx ON test.test (a) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'true'}; CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'true'}; CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'true'}; {code} {code} $ cat > generate.py import sys import random def main(args): n = int(args[1]) for i in xrange(n): a = '0' b = i % 10 c = i % (n / 10) + random.randint(0, 10) print ("%d,%s,%d,%d,%d" % (i, a, b, c, i)) if __name__ == '__main__': main(sys.argv) $ python generate.py 200 > test.csv {code} {code} COPY test.test FROM 'test.csv' WITH MAXBATCHSIZE = 100 AND MAXATTEMPTS = 10 AND MAXINSERTERRORS = 99; {code} {code} cqlsh> SELECT * FROM test.test WHERE a = '1' AND c = '38151' LIMIT 1 ALLOW FILTERING; r | a | b | c | data ---+---+---+---+-- (0 rows) Tracing session: fbc23200-f522-11e6-95df-69d39475f5a8 activity | timestamp | source| source_elapsed | client ---++---++--- Execute CQL3 query | 2017-02-17 16:08:48.288000 | 127.0.0.1 | 0 | 127.0.0.1 Parsing SELECT * FROM test.test WHERE a = '1' AND c = '38151' LIMIT 1 ALLOW FILTERING; [Native-Transport-Requests-1] | 2017-02-17 16:08:48.288000 | 127.0.0.1 |268 | 127.0.0.1 Preparing statement [Native-Transport-Requests-1] | 2017-02-17 16:08:48.289000 | 127.0.0.1 |513 | 127.0.0.1 Index mean cardinalities are test_a_idx:-9223372036854775808,test_c_idx:-9223372036854775808. Scanning with test_a_idx. [Native-Transport-Requests-1] | 2017-02-17 16:08:48.289000 | 127.0.0.1 |913 | 127.0.0.1 Computing ranges to query [Native-Transport-Requests-1] | 2017-02-17 16:08:48.289000 | 127.0.0.1 | 1027 | 127.0.0.1 Submitting range requests on 257 ranges with a concurrency of 1 (-3.24259165E16 rows per range expected) [Native-Transport-Requests-1] | 2017-02-17 16:08:48.289001 | 127.0.0.1 | 1319 | 127.0.0.1 Submitted 1 concurrent range requests [Native-Transport-Requests-1] | 2017-02-17 16:08:48.29 | 127.0.0.1 | 2229 | 127.0.0.1 Executing read on test.test using index test_a_idx [ReadStage-3] | 2017-02-17 16:08:48.292000 | 127.0.0.1 | 3494 | 127.0.0.1 Read 0 live and 0 tombstone cells [ReadStage-3] | 2017-02-17 16:08:48.293000 | 127.0.0.1 | 4694 | 127.0.0.1 Request complete | 2017-02-17 16:08:48.292930 | 127.0.0.1 | 4930 | 127.0.0.1 {code} Yay ! No more iterating on the useless index. Patch is on https://github.com/iksaif/cassandra/tree/sasi-null-intersect > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement >
[jira] [Updated] (CASSANDRA-13189) Use prompt_toolkit in cqlsh
[ https://issues.apache.org/jira/browse/CASSANDRA-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13189: --- Description: prompt_toolkit is an alternative to readline (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a lot of software, including the upcomming version of ipython. I'm working on an initial version that keeps compatibility with readline, which is available here: https://github.com/iksaif/cassandra/tree/prompt_toolkit It's still missing tests and a few things, but I'm opening this for tracking and feedbacks. !cqlsh-prompt-tookit.png|thumbnail! was: prompt_toolkit is an alternative to readline (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a lot of software, including the upcomming version of ipython. I'm working on an initial version that keeps compatibility with readline, which is available here: https://github.com/iksaif/cassandra/tree/prompt_toolkit It's still missing tests and a few things, but I'm opening this for tracking and feedbacks. !https://issues.apache.org/jira/secure/attachment/12851335/cqlsh-prompt-tookit.png|thumbnail! > Use prompt_toolkit in cqlsh > --- > > Key: CASSANDRA-13189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13189 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Attachments: cqlsh-prompt-tookit.png > > > prompt_toolkit is an alternative to readline > (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a > lot of software, including the upcomming version of ipython. > I'm working on an initial version that keeps compatibility with readline, > which is available here: > https://github.com/iksaif/cassandra/tree/prompt_toolkit > It's still missing tests and a few things, but I'm opening this for tracking > and feedbacks. > !cqlsh-prompt-tookit.png|thumbnail! -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13189) Use prompt_toolkit in cqlsh
[ https://issues.apache.org/jira/browse/CASSANDRA-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871933#comment-15871933 ] Corentin Chary commented on CASSANDRA-13189: !cqlsh-prompt-tookit.png|thumbnail! > Use prompt_toolkit in cqlsh > --- > > Key: CASSANDRA-13189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13189 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Attachments: cqlsh-prompt-tookit.png > > > prompt_toolkit is an alternative to readline > (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a > lot of software, including the upcomming version of ipython. > I'm working on an initial version that keeps compatibility with readline, > which is available here: > https://github.com/iksaif/cassandra/tree/prompt_toolkit > It's still missing tests and a few things, but I'm opening this for tracking > and feedbacks. > !cqlsh-prompt-tookit.png|thumbnail! -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13189) Use prompt_toolkit in cqlsh
[ https://issues.apache.org/jira/browse/CASSANDRA-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13189: --- Description: prompt_toolkit is an alternative to readline (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a lot of software, including the upcomming version of ipython. I'm working on an initial version that keeps compatibility with readline, which is available here: https://github.com/iksaif/cassandra/tree/prompt_toolkit It's still missing tests and a few things, but I'm opening this for tracking and feedbacks. !https://issues.apache.org/jira/secure/attachment/12851335/cqlsh-prompt-tookit.png|thumbnail! was: prompt_toolkit is an alternative to readline (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a lot of software, including the upcomming version of ipython. I'm working on an initial version that keeps compatibility with readline, which is available here: https://github.com/iksaif/cassandra/tree/prompt_toolkit It's still missing tests and a few things, but I'm opening this for tracking and feedbacks. !cqlsh-prompt-toolkit.png! > Use prompt_toolkit in cqlsh > --- > > Key: CASSANDRA-13189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13189 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Attachments: cqlsh-prompt-tookit.png > > > prompt_toolkit is an alternative to readline > (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a > lot of software, including the upcomming version of ipython. > I'm working on an initial version that keeps compatibility with readline, > which is available here: > https://github.com/iksaif/cassandra/tree/prompt_toolkit > It's still missing tests and a few things, but I'm opening this for tracking > and feedbacks. > !https://issues.apache.org/jira/secure/attachment/12851335/cqlsh-prompt-tookit.png|thumbnail! -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-13189) Use prompt_toolkit in cqlsh
[ https://issues.apache.org/jira/browse/CASSANDRA-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871933#comment-15871933 ] Corentin Chary edited comment on CASSANDRA-13189 at 2/17/17 2:48 PM: - !cqlsh-prompt-tookit.png! was (Author: iksaif): !cqlsh-prompt-tookit.png|thumbnail! > Use prompt_toolkit in cqlsh > --- > > Key: CASSANDRA-13189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13189 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Attachments: cqlsh-prompt-tookit.png > > > prompt_toolkit is an alternative to readline > (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a > lot of software, including the upcomming version of ipython. > I'm working on an initial version that keeps compatibility with readline, > which is available here: > https://github.com/iksaif/cassandra/tree/prompt_toolkit > It's still missing tests and a few things, but I'm opening this for tracking > and feedbacks. > !cqlsh-prompt-tookit.png|thumbnail! -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-13189) Use prompt_toolkit in cqlsh
[ https://issues.apache.org/jira/browse/CASSANDRA-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-13189: --- Description: prompt_toolkit is an alternative to readline (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a lot of software, including the upcomming version of ipython. I'm working on an initial version that keeps compatibility with readline, which is available here: https://github.com/iksaif/cassandra/tree/prompt_toolkit It's still missing tests and a few things, but I'm opening this for tracking and feedbacks. !cqlsh-prompt-toolkit.png! was: prompt_toolkit is an alternative to readline (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a lot of software, including the upcomming version of ipython. I'm working on an initial version that keeps compatibility with readline, which is available here: https://github.com/iksaif/cassandra/tree/prompt_toolkit It's still missing tests and a few things, but I'm opening this for tracking and feedbacks. > Use prompt_toolkit in cqlsh > --- > > Key: CASSANDRA-13189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13189 > Project: Cassandra > Issue Type: New Feature > Components: Tools >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Minor > Attachments: cqlsh-prompt-tookit.png > > > prompt_toolkit is an alternative to readline > (https://github.com/jonathanslenders/python-prompt-toolkit) and is used in a > lot of software, including the upcomming version of ipython. > I'm working on an initial version that keeps compatibility with readline, > which is available here: > https://github.com/iksaif/cassandra/tree/prompt_toolkit > It's still missing tests and a few things, but I'm opening this for tracking > and feedbacks. > !cqlsh-prompt-toolkit.png! -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corentin Chary updated CASSANDRA-12915: --- Summary: SASI: Index intersection with an empty range really inefficient (was: SASI: Index intersection can be very inefficient) > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()
[ https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865188#comment-15865188 ] Corentin Chary commented on CASSANDRA-13038: The code and the remaining property looks good to me. The code of the benchmark could probably be slightly refactored but that's not really a big deal. > 33% of compaction time spent in StreamingHistogram.update() > --- > > Key: CASSANDRA-13038 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13038 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Corentin Chary >Assignee: Jeff Jirsa > Attachments: compaction-speedup.patch, > compaction-streaminghistrogram.png, profiler-snapshot.nps > > > With the following table, that contains a *lot* of cells: > {code} > CREATE TABLE biggraphite.datapoints_11520p_60s ( > metric uuid, > time_start_ms bigint, > offset smallint, > count int, > value double, > PRIMARY KEY ((metric, time_start_ms), offset) > ) WITH CLUSTERING ORDER BY (offset DESC); > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', > 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', > 'max_threshold': '32', 'min_threshold': '6'} > Keyspace : biggraphite > Read Count: 1822 > Read Latency: 1.8870054884742042 ms. > Write Count: 2212271647 > Write Latency: 0.027705127678653473 ms. > Pending Flushes: 0 > Table: datapoints_11520p_60s > SSTable count: 47 > Space used (live): 300417555945 > Space used (total): 303147395017 > Space used by snapshots (total): 0 > Off heap memory used (total): 207453042 > SSTable Compression Ratio: 0.4955200053039823 > Number of keys (estimate): 16343723 > Memtable cell count: 220576 > Memtable data size: 17115128 > Memtable off heap memory used: 0 > Memtable switch count: 2872 > Local read count: 0 > Local read latency: NaN ms > Local write count: 1103167888 > Local write latency: 0.025 ms > Pending flushes: 0 > Percent repaired: 0.0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.0 > Bloom filter space used: 105118296 > Bloom filter off heap memory used: 106547192 > Index summary off heap memory used: 27730962 > Compression metadata off heap memory used: 73174888 > Compacted partition minimum bytes: 61 > Compacted partition maximum bytes: 51012 > Compacted partition mean bytes: 7899 > Average live cells per slice (last five minutes): NaN > Maximum live cells per slice (last five minutes): 0 > Average tombstones per slice (last five minutes): NaN > Maximum tombstones per slice (last five minutes): 0 > Dropped Mutations: 0 > {code} > It looks like a good chunk of the compaction time is lost in > StreamingHistogram.update() (which is used to store the estimated tombstone > drop times). > This could be caused by a huge number of different deletion times which would > makes the bin huge but it this histogram should be capped to 100 keys. It's > more likely caused by the huge number of cells. > A simple solutions could be to only take into accounts part of the cells, the > fact the this table has a TWCS also gives us an additional hint that sampling > deletion times would be fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-13038) 33% of compaction time spent in StreamingHistogram.update()
[ https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865188#comment-15865188 ] Corentin Chary edited comment on CASSANDRA-13038 at 2/14/17 6:41 AM: - The code and the remaining property looks good to me. The code of the benchmark could probably be slightly refactored but that's not really a big deal. Thanks for doing it ! was (Author: iksaif): The code and the remaining property looks good to me. The code of the benchmark could probably be slightly refactored but that's not really a big deal. > 33% of compaction time spent in StreamingHistogram.update() > --- > > Key: CASSANDRA-13038 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13038 > Project: Cassandra > Issue Type: Bug > Components: Compaction >Reporter: Corentin Chary >Assignee: Jeff Jirsa > Attachments: compaction-speedup.patch, > compaction-streaminghistrogram.png, profiler-snapshot.nps > > > With the following table, that contains a *lot* of cells: > {code} > CREATE TABLE biggraphite.datapoints_11520p_60s ( > metric uuid, > time_start_ms bigint, > offset smallint, > count int, > value double, > PRIMARY KEY ((metric, time_start_ms), offset) > ) WITH CLUSTERING ORDER BY (offset DESC); > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', > 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', > 'max_threshold': '32', 'min_threshold': '6'} > Keyspace : biggraphite > Read Count: 1822 > Read Latency: 1.8870054884742042 ms. > Write Count: 2212271647 > Write Latency: 0.027705127678653473 ms. > Pending Flushes: 0 > Table: datapoints_11520p_60s > SSTable count: 47 > Space used (live): 300417555945 > Space used (total): 303147395017 > Space used by snapshots (total): 0 > Off heap memory used (total): 207453042 > SSTable Compression Ratio: 0.4955200053039823 > Number of keys (estimate): 16343723 > Memtable cell count: 220576 > Memtable data size: 17115128 > Memtable off heap memory used: 0 > Memtable switch count: 2872 > Local read count: 0 > Local read latency: NaN ms > Local write count: 1103167888 > Local write latency: 0.025 ms > Pending flushes: 0 > Percent repaired: 0.0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.0 > Bloom filter space used: 105118296 > Bloom filter off heap memory used: 106547192 > Index summary off heap memory used: 27730962 > Compression metadata off heap memory used: 73174888 > Compacted partition minimum bytes: 61 > Compacted partition maximum bytes: 51012 > Compacted partition mean bytes: 7899 > Average live cells per slice (last five minutes): NaN > Maximum live cells per slice (last five minutes): 0 > Average tombstones per slice (last five minutes): NaN > Maximum tombstones per slice (last five minutes): 0 > Dropped Mutations: 0 > {code} > It looks like a good chunk of the compaction time is lost in > StreamingHistogram.update() (which is used to store the estimated tombstone > drop times). > This could be caused by a huge number of different deletion times which would > makes the bin huge but it this histogram should be capped to 100 keys. It's > more likely caused by the huge number of cells. > A simple solutions could be to only take into accounts part of the cells, the > fact the this table has a TWCS also gives us an additional hint that sampling > deletion times would be fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346)