[jira] [Commented] (CASSANDRA-6694) Slightly More Off-Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970487#comment-13970487 ] Pavel Yaskevich commented on CASSANDRA-6694: Also it seems like for some of the methods e.g. updateDigest, delta, dataSize, diff, reconcile, hashCode etc. it would be much better to have native implementations which work with underlying bytes directly from day one. Some of them, for example, use value().remaining(), value().compareTo(), value().duplicate(), or name.toByteBuffer() convert data from one representation to another for no real reason, so we can actually end up generating a lot more temporary objects then we anticipate. There is another concern related to value() method which converts pointer to DirectBuffer, the problem is that (at least in OpenJDK and I think Oracle done the same) initialization of that class is synchronized and creates PhantomReference, which with most collectors only be purged by Full GC. > Slightly More Off-Heap Memtables > > > Key: CASSANDRA-6694 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6694 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Labels: performance > Fix For: 2.1 beta2 > > > The Off Heap memtables introduced in CASSANDRA-6689 don't go far enough, as > the on-heap overhead is still very large. It should not be tremendously > difficult to extend these changes so that we allocate entire Cells off-heap, > instead of multiple BBs per Cell (with all their associated overhead). > The goal (if possible) is to reach an overhead of 16-bytes per Cell (plus 4-6 > bytes per cell on average for the btree overhead, for a total overhead of > around 20-22 bytes). This translates to 8-byte object overhead, 4-byte > address (we will do alignment tricks like the VM to allow us to address a > reasonably large memory space, although this trick is unlikely to last us > forever, at which point we will have to bite the bullet and accept a 24-byte > per cell overhead), and 4-byte object reference for maintaining our internal > list of allocations, which is unfortunately necessary since we cannot safely > (and cheaply) walk the object graph we allocate otherwise, which is necessary > for (allocation-) compaction and pointer rewriting. > The ugliest thing here is going to be implementing the various CellName > instances so that they may be backed by native memory OR heap memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6694) Slightly More Off-Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970462#comment-13970462 ] Pavel Yaskevich commented on CASSANDRA-6694: [~benedict] While working on trying to avoid usage of Impl classes and looking closer at the code I have a question, which knowing that future is going to be totally off-heap makes sense to ask now: current Native*Cell classes re-use Impl code from static implementations of interfaces but some of the methods e.g. reconcile for Counter(Update)Cell in certain conditions need to generate a new object (for now we are allocating BufferCounterCell which allows as to use CounterCell.Impl.reconcile for both implementations), do you have an action plan regarding required changes in that regard for the next step in this series when we are not going to copy things back to heap? > Slightly More Off-Heap Memtables > > > Key: CASSANDRA-6694 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6694 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Labels: performance > Fix For: 2.1 beta2 > > > The Off Heap memtables introduced in CASSANDRA-6689 don't go far enough, as > the on-heap overhead is still very large. It should not be tremendously > difficult to extend these changes so that we allocate entire Cells off-heap, > instead of multiple BBs per Cell (with all their associated overhead). > The goal (if possible) is to reach an overhead of 16-bytes per Cell (plus 4-6 > bytes per cell on average for the btree overhead, for a total overhead of > around 20-22 bytes). This translates to 8-byte object overhead, 4-byte > address (we will do alignment tricks like the VM to allow us to address a > reasonably large memory space, although this trick is unlikely to last us > forever, at which point we will have to bite the bullet and accept a 24-byte > per cell overhead), and 4-byte object reference for maintaining our internal > list of allocations, which is unfortunately necessary since we cannot safely > (and cheaply) walk the object graph we allocate otherwise, which is necessary > for (allocation-) compaction and pointer rewriting. > The ugliest thing here is going to be implementing the various CellName > instances so that they may be backed by native memory OR heap memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7042) Disk space growth until restart
[ https://issues.apache.org/jira/browse/CASSANDRA-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Aller updated CASSANDRA-7042: -- Attachment: after.log before.log > Disk space growth until restart > --- > > Key: CASSANDRA-7042 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7042 > Project: Cassandra > Issue Type: Bug > Environment: Ubuntu 12.04 > Sun Java 7 > Cassandra 2.0.6 >Reporter: Zach Aller >Priority: Critical > Attachments: after.log, before.log > > > Cassandra will constantly eat disk space not sure whats causing it the only > thing that seems to fix it is a restart of cassandra this happens about every > 3-5 hrs we will grow from about 350GB to 650GB with no end in site. Once we > restart cassandra it usually all clears itself up and disks return to normal > for a while then something triggers its and starts climbing again. Sometimes > when we restart compactions pending skyrocket and if we restart a second time > the compactions pending drop off back to a normal level. One other thing to > note is the space is not free'd until cassandra starts back up and not when > shutdown. > I will get a clean log of before and after restarting next time it happens > and post it. > Here is a common ERROR in our logs that might be related > ERROR [CompactionExecutor:46] 2014-04-15 09:12:51,040 CassandraDaemon.java > (line 196) Exception in thread Thread[CompactionExecutor:46,1,main] > java.lang.RuntimeException: java.io.FileNotFoundException: > /local-project/cassandra_data/data/wxgrid/grid/wxgrid-grid-jb-468677-Data.db > (No such file or directory) > at > org.apache.cassandra.io.util.ThrottledReader.open(ThrottledReader.java:53) > at > org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1355) > at > org.apache.cassandra.io.sstable.SSTableScanner.(SSTableScanner.java:67) > at > org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1161) > at > org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1173) > at > org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getScanners(LeveledCompactionStrategy.java:194) > at > org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:258) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:126) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:197) > at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > at java.util.concurrent.FutureTask.run(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > Caused by: java.io.FileNotFoundException: > /local-project/cassandra_data/data/wxgrid/grid/wxgrid-grid-jb-468677-Data.db > (No such file or directory) > at java.io.RandomAccessFile.open(Native Method) > at java.io.RandomAccessFile.(Unknown Source) > at > org.apache.cassandra.io.util.RandomAccessReader.(RandomAccessReader.java:58) > at > org.apache.cassandra.io.util.ThrottledReader.(ThrottledReader.java:35) > at > org.apache.cassandra.io.util.ThrottledReader.open(ThrottledReader.java:49) > ... 17 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7030) Remove JEMallocAllocator
[ https://issues.apache.org/jira/browse/CASSANDRA-7030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7030: Attachment: benchmark.21.diff.txt bq. As mentioned earlier i don't mind removing it either Well, if it demonstrates an advantage I'd prefer to keep it still :-) Could you try running my benchmark, so we can compare the more specific stats, and can rule out interference by CLHM? I'm particularly surprised that it is anything like as fast, let alone faster, given how much dramatically slower it is on my box (36MB/s is laughable). It's possible I have an older version of jemalloc bundled with Ubuntu (I cannot run multi-threaded, but I think this is down to compile options), but I assume the only explanation for such awful performance is JNA. I've attached a diff that should apply to 2.1. > Remove JEMallocAllocator > > > Key: CASSANDRA-7030 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7030 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1 beta2 > > Attachments: 7030.txt, benchmark.21.diff.txt > > > JEMalloc, whilst having some nice performance properties by comparison to > Doug Lea's standard malloc algorithm in principle, is pointless in practice > because of the JNA cost. In general it is around 30x more expensive to call > than unsafe.allocate(); malloc does not have a variability of response time > as extreme as the JNA overhead, so using JEMalloc in Cassandra is never a > sensible idea. I doubt if custom JNI would make it worthwhile either. > I propose removing it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7030) Remove JEMallocAllocator
[ https://issues.apache.org/jira/browse/CASSANDRA-7030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970297#comment-13970297 ] Vijay commented on CASSANDRA-7030: -- You are right i had the synchronization in the test attached in the old ticket because initially we had some segfaults which was fixed in the later JEM releases, but the synchronization was never committed into cassandra repo because by then it was fixed. Rerunning the test after removing the locks in the same old test classes, the results the time take is much better in jemalloc, you might need more runs. The memory foot print is better too (malloc is slower and uses more memory comparatively as per my tests). http://pastebin.com/JtixVvGU As mentioned earlier i don't mind removing it either :) > Remove JEMallocAllocator > > > Key: CASSANDRA-7030 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7030 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1 beta2 > > Attachments: 7030.txt > > > JEMalloc, whilst having some nice performance properties by comparison to > Doug Lea's standard malloc algorithm in principle, is pointless in practice > because of the JNA cost. In general it is around 30x more expensive to call > than unsafe.allocate(); malloc does not have a variability of response time > as extreme as the JNA overhead, so using JEMalloc in Cassandra is never a > sensible idea. I doubt if custom JNI would make it worthwhile either. > I propose removing it. -- This message was sent by Atlassian JIRA (v6.2#6252)
git commit: Allow cassandra to compile under java 8
Repository: cassandra Updated Branches: refs/heads/trunk 2804ce994 -> 4d0691759 Allow cassandra to compile under java 8 patch by dbrosius reviewed by jmckenzie for cassandra-7028 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/4d069175 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/4d069175 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/4d069175 Branch: refs/heads/trunk Commit: 4d0691759a19f1faafe889d765145ae6a5096397 Parents: 2804ce9 Author: Dave Brosius Authored: Tue Apr 15 20:36:16 2014 -0400 Committer: Dave Brosius Committed: Tue Apr 15 20:38:32 2014 -0400 -- CHANGES.txt | 1 + build.xml| 11 --- lib/antlr-3.2.jar| Bin 1928009 -> 0 bytes lib/antlr-runtime-3.5.2.jar | Bin 0 -> 167761 bytes lib/licenses/antlr-3.2.txt | 27 -- lib/licenses/antlr-runtime-3.5.2.txt | 27 ++ lib/licenses/stringtemplate-4.0.2.txt| 27 ++ lib/stringtemplate-4.0.2.jar | Bin 0 -> 226406 bytes src/java/org/apache/cassandra/cql3/Cql.g | 22 - 9 files changed, 80 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/4d069175/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index cbf82de..2fbf3ae 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -4,6 +4,7 @@ * Remove CQL2 (CASSANDRA-5918) * Add Thrift get_multi_slice call (CASSANDRA-6757) * Optimize fetching multiple cells by name (CASSANDRA-6933) + * Allow compilation in java 8 (CASSANDRA-7208) 2.1.0-beta2 http://git-wip-us.apache.org/repos/asf/cassandra/blob/4d069175/build.xml -- diff --git a/build.xml b/build.xml index 8c4cb7b..9326424 100644 --- a/build.xml +++ b/build.xml @@ -190,7 +190,7 @@ Building Grammar ${build.src.java}/org/apache/cassandra/cli/Cli.g @@ -211,7 +211,7 @@ Building Grammar ${build.src.java}/org/apache/cassandra/cql3/Cql.g ... @@ -330,7 +330,9 @@ - + + + @@ -403,6 +405,7 @@ + @@ -444,6 +447,8 @@ + + http://git-wip-us.apache.org/repos/asf/cassandra/blob/4d069175/lib/antlr-3.2.jar -- diff --git a/lib/antlr-3.2.jar b/lib/antlr-3.2.jar deleted file mode 100644 index fdd167d..000 Binary files a/lib/antlr-3.2.jar and /dev/null differ http://git-wip-us.apache.org/repos/asf/cassandra/blob/4d069175/lib/antlr-runtime-3.5.2.jar -- diff --git a/lib/antlr-runtime-3.5.2.jar b/lib/antlr-runtime-3.5.2.jar new file mode 100644 index 000..d48e3e8 Binary files /dev/null and b/lib/antlr-runtime-3.5.2.jar differ http://git-wip-us.apache.org/repos/asf/cassandra/blob/4d069175/lib/licenses/antlr-3.2.txt -- diff --git a/lib/licenses/antlr-3.2.txt b/lib/licenses/antlr-3.2.txt deleted file mode 100644 index 015a53d..000 --- a/lib/licenses/antlr-3.2.txt +++ /dev/null @@ -1,27 +0,0 @@ - -Copyright (c) 2003-2006 Terence Parr -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions -are met: - - 1. Redistributions of source code must retain the above copyright -notice, this list of conditions and the following disclaimer. - 2. Redistributions in binary form must reproduce the above copyright -notice, this list of conditions and the following disclaimer in the -documentation and/or other materials provided with the distribution. - 3. The name of the author may not be used to endorse or promote products -derived from this software without specific prior written permission. - -THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR -IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES -OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. -IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, -INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT -NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, -DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY -THEORY OF LI
[jira] [Commented] (CASSANDRA-6572) Workload recording / playback
[ https://issues.apache.org/jira/browse/CASSANDRA-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970268#comment-13970268 ] Aleksey Yeschenko commented on CASSANDRA-6572: -- I'd say 3.0, with 2.1 being so close, and delayed. > Workload recording / playback > - > > Key: CASSANDRA-6572 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6572 > Project: Cassandra > Issue Type: New Feature > Components: Core, Tools >Reporter: Jonathan Ellis >Assignee: Lyuben Todorov > Fix For: 2.0.8 > > Attachments: 6572-trunk.diff > > > "Write sample mode" gets us part way to testing new versions against a real > world workload, but we need an easy way to test the query side as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-6602: Summary: Compaction improvements to optimize time series data (was: Enhancements to optimize for the storing of time series data) > Compaction improvements to optimize time series data > > > Key: CASSANDRA-6602 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Tupshin Harper > Labels: performance > Fix For: 3.0 > > > There are some unique characteristics of many/most time series use cases that > both provide challenges, as well as provide unique opportunities for > optimizations. > One of the major challenges is in compaction. The existing compaction > strategies will tend to re-compact data on disk at least a few times over the > lifespan of each data point, greatly increasing the cpu and IO costs of that > write. > Compaction exists to > 1) ensure that there aren't too many files on disk > 2) ensure that data that should be contiguous (part of the same partition) is > laid out contiguously > 3) deleting data due to ttls or tombstones > The special characteristics of time series data allow us to optimize away all > three. > Time series data > 1) tends to be delivered in time order, with relatively constrained exceptions > 2) often has a pre-determined and fixed expiration date > 3) Never gets deleted prior to TTL > 4) Has relatively predictable ingestion rates > Note that I filed CASSANDRA-5561 and this ticket potentially replaces or > lowers the need for it. In that ticket, jbellis reasonably asks, how that > compaction strategy is better than disabling compaction. > Taking that to heart, here is a compaction-strategy-less approach that could > be extremely efficient for time-series use cases that follow the above > pattern. > (For context, I'm thinking of an example use case involving lots of streams > of time-series data with a 5GB per day ingestion rate, and a 1000 day > retention with TTL, resulting in an eventual steady state of 5TB per node) > 1) You have an extremely large memtable (preferably off heap, if/when doable) > for the table, and that memtable is sized to be able to hold a lengthy window > of time. A typical period might be one day. At the end of that period, you > flush the contents of the memtable to an sstable and move to the next one. > This is basically identical to current behaviour, but with thresholds > adjusted so that you can ensure flushing at predictable intervals. (Open > question is whether predictable intervals is actually necessary, or whether > just waiting until the huge memtable is nearly full is sufficient) > 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be > efficiently dropped once all of the columns have. (Another side note, it > might be valuable to have a modified version of CASSANDRA-3974 that doesn't > bother storing per-column TTL since it is required that all columns have the > same TTL) > 3) Be able to mark column families as read/write only (no explicit deletes), > so no tombstones. > 4) Optionally add back an additional type of delete that would delete all > data earlier than a particular timestamp, resulting in immediate dropping of > obsoleted sstables. > The result is that for in-order delivered data, Every cell will be laid out > optimally on disk on the first pass, and over the course of 1000 days and 5TB > of data, there will "only" be 1000 5GB sstables, so the number of filehandles > will be reasonable. > For exceptions (out-of-order delivery), most cases will be caught by the > extended (24 hour+) memtable flush times and merged correctly automatically. > For those that were slightly askew at flush time, or were delivered so far > out of order that they go in the wrong sstable, there is relatively low > overhead to reading from two sstables for a time slice, instead of one, and > that overhead would be incurred relatively rarely unless out-of-order > delivery was the common case, in which case, this strategy should not be used. > Another possible optimization to address out-of-order would be to maintain > more than one time-centric memtables in memory at a time (e.g. two 12 hour > ones), and then you always insert into whichever one of the two "owns" the > appropriate range of time. By delaying flushing the ahead one until we are > ready to roll writes over to a third one, we are able to avoid any > fragmentation as long as all deliveries come in no more than 12 hours late > (in this example, presumably tunable). > Anything that triggers compactions will have to be looked at, since there > won't be any. The one concern I have is the ramificaiton of
[jira] [Updated] (CASSANDRA-6066) LHF 2i performance improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-6066: Labels: performance (was: ) > LHF 2i performance improvements > --- > > Key: CASSANDRA-6066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6066 > Project: Cassandra > Issue Type: Improvement >Reporter: Aleksey Yeschenko >Assignee: Lyuben Todorov >Priority: Minor > Labels: performance > Fix For: 2.0.8 > > > We should perform more aggressive paging over the index partition (costs us > nothing) and also fetch the rows from the base table in one slice query (at > least the ones belonging to the same partition). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-5220) Repair improvements when using vnodes
[ https://issues.apache.org/jira/browse/CASSANDRA-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-5220: Labels: performance repair (was: ) > Repair improvements when using vnodes > - > > Key: CASSANDRA-5220 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5220 > Project: Cassandra > Issue Type: Improvement > Components: Core >Affects Versions: 1.2.0 beta 1 >Reporter: Brandon Williams >Assignee: Yuki Morishita > Labels: performance, repair > Fix For: 2.1 beta2 > > > Currently when using vnodes, repair takes much longer to complete than > without them. This appears at least in part because it's using a session per > range and processing them sequentially. This generates a lot of log spam > with vnodes, and while being gentler and lighter on hard disk deployments, > ssd-based deployments would often prefer that repair be as fast as possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-7029) Investigate alternative transport protocols for both client and inter-server communications
[ https://issues.apache.org/jira/browse/CASSANDRA-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-7029: --- Assignee: Benedict > Investigate alternative transport protocols for both client and inter-server > communications > --- > > Key: CASSANDRA-7029 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7029 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Labels: performance > Fix For: 3.0 > > > There are a number of reasons to think we can do better than TCP for our > communications: > 1) We can actually tolerate sporadic small message losses, so guaranteed > delivery isn't essential (although for larger messages it probably is) > 2) As shown in \[1\] and \[2\], Linux can behave quite suboptimally with > regard to TCP message delivery when the system is under load. Judging from > the theoretical description, this is likely to apply even when the > system-load is not high, but the number of processes to schedule is high. > Cassandra generally has a lot of threads to schedule, so this is quite > pertinent for us. UDP performs substantially better here. > 3) Even when the system is not under load, UDP has a lower CPU burden, and > that burden is constant regardless of the number of connections it processes. > 4) On a simple benchmark on my local PC, using non-blocking IO for UDP and > busy spinning on IO I can actually push 20-40% more throughput through > loopback (where TCP should be optimal, as no latency), even for very small > messages. Since we can see networking taking multiple CPUs' worth of time > during a stress test, using a busy-spin for ~100micros after last message > receipt is almost certainly acceptable, especially as we can (ultimately) > process inter-server and client communications on the same thread/socket in > this model. > 5) We can optimise the threading model heavily: since we generally process > very small messages (200 bytes not at all implausible), the thread signalling > costs on the processing thread can actually dramatically impede throughput. > In general it costs ~10micros to signal (and passing the message to another > thread for processing in the current model requires signalling). For 200-byte > messages this caps our throughput at 20MB/s. > I propose to knock up a highly naive UDP-based connection protocol with > super-trivial congestion control over the course of a few days, with the only > initial goal being maximum possible performance (not fairness, reliability, > or anything else), and trial it in Netty (possibly making some changes to > Netty to mitigate thread signalling costs). The reason for knocking up our > own here is to get a ceiling on what the absolute limit of potential for this > approach is. Assuming this pans out with performance gains in C* proper, we > then look to contributing to/forking the udt-java project and see how easy it > is to bring performance in line with what we can get with our naive approach > (I don't suggest starting here, as the project is using blocking old-IO, and > modifying it with latency in mind may be challenging, and we won't know for > sure what the best case scenario is). > \[1\] > http://test-docdb.fnal.gov/0016/001648/002/Potential%20Performance%20Bottleneck%20in%20Linux%20TCP.PDF > \[2\] > http://cd-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=1968;filename=Performance%20Analysis%20of%20Linux%20Networking%20-%20Packet%20Receiving%20(Official).pdf;version=2 > Further related reading: > http://public.dhe.ibm.com/software/commerce/doc/mft/cdunix/41/UDTWhitepaper.pdf > https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/14482/ChoiUndPerTcp.pdf?sequence=1 > https://access.redhat.com/site/documentation/en-US/JBoss_Enterprise_Web_Platform/5/html/Administration_And_Configuration_Guide/jgroups-perf-udpbuffer.html > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153.3762&rep=rep1&type=pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-5019) Still too much object allocation on reads
[ https://issues.apache.org/jira/browse/CASSANDRA-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-5019: --- Assignee: (was: Benedict) > Still too much object allocation on reads > - > > Key: CASSANDRA-5019 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5019 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Jonathan Ellis > Labels: performance > Fix For: 3.0 > > > ArrayBackedSortedColumns was a step in the right direction but it's still > relatively heavyweight thanks to allocating individual Columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-6809) Compressed Commit Log
[ https://issues.apache.org/jira/browse/CASSANDRA-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-6809: --- Assignee: (was: Benedict) > Compressed Commit Log > - > > Key: CASSANDRA-6809 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6809 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Priority: Minor > Labels: performance > Fix For: 3.0 > > > It seems an unnecessary oversight that we don't compress the commit log. > Doing so should improve throughput, but some care will need to be taken to > ensure we use as much of a segment as possible. I propose decoupling the > writing of the records from the segments. Basically write into a (queue of) > DirectByteBuffer, and have the sync thread compress, say, ~64K chunks every X > MB written to the CL (where X is ordinarily CLS size), and then pack as many > of the compressed chunks into a CLS as possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-6861) Eliminate garbage in server-side native transport
[ https://issues.apache.org/jira/browse/CASSANDRA-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-6861: --- Assignee: (was: Benedict) > Eliminate garbage in server-side native transport > - > > Key: CASSANDRA-6861 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6861 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1 beta2 > > > Now we've upgraded to Netty 4, we're generating a lot of garbage that could > be avoided, so we should probably stop that. Should be reasonably easy to > hook into Netty's pooled buffers, returning them to the pool once a given > message is completed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-6726) Recycle CRAR/RAR buffers independently of their owners, and move them off-heap when possible
[ https://issues.apache.org/jira/browse/CASSANDRA-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-6726: --- Assignee: (was: Benedict) > Recycle CRAR/RAR buffers independently of their owners, and move them > off-heap when possible > > > Key: CASSANDRA-6726 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6726 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Priority: Minor > Labels: performance > Fix For: 3.0 > > > Whilst CRAR and RAR are pooled, we could and probably should pool the buffers > independently, so that they are not tied to a specific sstable. It may be > possible to move the RAR buffer off-heap, and the CRAR sometimes (e.g. Snappy > may possibly support off-heap buffers) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-6755) Optimise CellName/Composite comparisons for NativeCell
[ https://issues.apache.org/jira/browse/CASSANDRA-6755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-6755: --- Assignee: (was: Benedict) > Optimise CellName/Composite comparisons for NativeCell > -- > > Key: CASSANDRA-6755 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6755 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Priority: Minor > Labels: performance > Fix For: 3.0 > > > As discussed in CASSANDRA-6694, to reduce temporary garbage generation we > should minimise the incidence of CellName component extraction. The biggest > win will be to perform comparisons on Cell where possible, instead of > CellName, so that Native*Cell can use its extra information to avoid creating > any ByteBuffer objects -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-6976: --- Assignee: (was: Benedict) > Determining replicas to query is very slow with large numbers of nodes or > vnodes > > > Key: CASSANDRA-6976 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6976 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict > Labels: performance > Fix For: 2.1 > > > As described in CASSANDRA-6906, this can be ~100ms for a relatively small > cluster with vnodes, which is longer than it will spend in transit on the > network. This should be much faster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-6935) Make clustering part of primary key a first order component in the storage engine
[ https://issues.apache.org/jira/browse/CASSANDRA-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-6935: --- Assignee: (was: Benedict) > Make clustering part of primary key a first order component in the storage > engine > - > > Key: CASSANDRA-6935 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6935 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict > Labels: performance > Fix For: 3.0 > > > It would be helpful for a number of upcoming improvements if the clustering > part of the primary key were extracted from CellName, and if a ColumnFamily > object could store multiple ClusteredRow (or similar) instances, within which > each cell is keyed only by the column identifier. > This would also, by itself, reduce on comparison costs and also permit memory > savings in memtables, by sharing the clustering part of the primary key > across all cells in the same row. It might also make it easier to move more > data off-heap, by constructing an off-heap clustered row, but keeping the > partition level object on-heap. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-6936) Make all byte representations of types comparable by their unsigned byte representation only
[ https://issues.apache.org/jira/browse/CASSANDRA-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-6936: --- Assignee: (was: Benedict) > Make all byte representations of types comparable by their unsigned byte > representation only > > > Key: CASSANDRA-6936 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6936 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict > Labels: performance > Fix For: 3.0 > > > This could be a painful change, but is necessary for implementing a > trie-based index, and settling for less would be suboptimal; it also should > make comparisons cheaper all-round, and since comparison operations are > pretty much the majority of C*'s business, this should be easily felt (see > CASSANDRA-6553 and CASSANDRA-6934 for an example of some minor changes with > major performance impacts). No copying/special casing/slicing should mean > fewer opportunities to introduce performance regressions as well. > Since I have slated for 3.0 a lot of non-backwards-compatible sstable > changes, hopefully this shouldn't be too much more of a burden. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970143#comment-13970143 ] Aleksey Yeschenko commented on CASSANDRA-6949: -- Probably talking about this - https://github.com/apache/cassandra/blob/2804ce9945a83a696e36b4add7a684b132fdef7c/src/java/org/apache/cassandra/db/compaction/LazilyCompactedRow.java#L226-L230 > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Sam Tunnicliffe > Attachments: > 0001-Remove-expansion-of-RangeTombstones-to-delete-from-2.patch, 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (CASSANDRA-7024) Create snapshot selectively during sequential repair
[ https://issues.apache.org/jira/browse/CASSANDRA-7024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuki Morishita resolved CASSANDRA-7024. --- Resolution: Fixed Thanks, committed. And yes, it looks like SnapshotCommand is not used any more, but I still leave it for now. > Create snapshot selectively during sequential repair > - > > Key: CASSANDRA-7024 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7024 > Project: Cassandra > Issue Type: Improvement >Reporter: Yuki Morishita >Assignee: Yuki Morishita >Priority: Minor > Fix For: 2.1 beta2 > > Attachments: > 0001-Only-snapshot-SSTables-related-to-validating-range.patch > > > When doing snapshot repair, right now we snapshot all SSTables, open them and > use just part of them for building MerkleTree. > Instead, we can snapshot and use only SSTables that are necessary to build > MerkleTree of interested range. -- This message was sent by Atlassian JIRA (v6.2#6252)
[2/3] git commit: Snapshot only related SSTables when sequential repair
Snapshot only related SSTables when sequential repair patch by yukim; reviewed by jmckenzie for CASSANDRA-7024 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/de8a479f Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/de8a479f Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/de8a479f Branch: refs/heads/trunk Commit: de8a479f2e1a8b536dedf2e6470301709bc3d9dc Parents: b69f5e3 Author: Yuki Morishita Authored: Tue Apr 15 17:13:45 2014 -0500 Committer: Yuki Morishita Committed: Tue Apr 15 17:13:45 2014 -0500 -- CHANGES.txt | 1 + .../apache/cassandra/db/ColumnFamilyStore.java | 18 ++- .../repair/RepairMessageVerbHandler.java| 33 +--- .../apache/cassandra/repair/SnapshotTask.java | 8 +-- .../repair/messages/RepairMessage.java | 3 +- .../repair/messages/SnapshotMessage.java| 53 6 files changed, 100 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/de8a479f/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index 592eef9..9f34023 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -45,6 +45,7 @@ * Add failure handler to async callback (CASSANDRA-6747) * Fix AE when closing SSTable without releasing reference (CASSANDRA-7000) * Clean up IndexInfo on keyspace/table drops (CASSANDRA-6924) + * Only snapshot relative SSTables when sequential repair (CASSANDRA-7024) Merged from 2.0: * Put nodes in hibernate when join_ring is false (CASSANDRA-6961) * Allow compaction of system tables during startup (CASSANDRA-6913) http://git-wip-us.apache.org/repos/asf/cassandra/blob/de8a479f/src/java/org/apache/cassandra/db/ColumnFamilyStore.java -- diff --git a/src/java/org/apache/cassandra/db/ColumnFamilyStore.java b/src/java/org/apache/cassandra/db/ColumnFamilyStore.java index ffea243..923ea5b 100644 --- a/src/java/org/apache/cassandra/db/ColumnFamilyStore.java +++ b/src/java/org/apache/cassandra/db/ColumnFamilyStore.java @@ -30,6 +30,7 @@ import javax.management.*; import com.google.common.annotations.VisibleForTesting; import com.google.common.base.Function; +import com.google.common.base.Predicate; import com.google.common.collect.*; import com.google.common.util.concurrent.*; import com.google.common.util.concurrent.Futures; @@ -2153,6 +2154,11 @@ public class ColumnFamilyStore implements ColumnFamilyStoreMBean public void snapshotWithoutFlush(String snapshotName) { +snapshotWithoutFlush(snapshotName, null); +} + +public void snapshotWithoutFlush(String snapshotName, Predicate predicate) +{ for (ColumnFamilyStore cfs : concatWithIndexes()) { DataTracker.View currentView = cfs.markCurrentViewReferenced(); @@ -2161,6 +2167,11 @@ public class ColumnFamilyStore implements ColumnFamilyStoreMBean { for (SSTableReader ssTable : currentView.sstables) { +if (predicate != null && !predicate.apply(ssTable)) +{ +continue; +} + File snapshotDirectory = Directories.getSnapshotDirectory(ssTable.descriptor, snapshotName); ssTable.createLinks(snapshotDirectory.getPath()); // hard links if (logger.isDebugEnabled()) @@ -2190,8 +2201,13 @@ public class ColumnFamilyStore implements ColumnFamilyStoreMBean */ public void snapshot(String snapshotName) { +snapshot(snapshotName, null); +} + +public void snapshot(String snapshotName, Predicate predicate) +{ forceBlockingFlush(); -snapshotWithoutFlush(snapshotName); +snapshotWithoutFlush(snapshotName, predicate); } public boolean snapshotExists(String snapshotName) http://git-wip-us.apache.org/repos/asf/cassandra/blob/de8a479f/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java -- diff --git a/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java b/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java index bb66b69..d710652 100644 --- a/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java +++ b/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java @@ -18,30 +18,32 @@ package org.apache.cassandra.repair; import java.util.ArrayList; +import java.util.Collections; import java.util.List; import java.util.UUID; import java.util.concurrent.Future; +import com.google.common.base.Predicate; +import org.slf4j.
[1/3] git commit: Snapshot only related SSTables when sequential repair
Repository: cassandra Updated Branches: refs/heads/cassandra-2.1 b69f5e363 -> de8a479f2 refs/heads/trunk fc4ae115a -> 2804ce994 Snapshot only related SSTables when sequential repair patch by yukim; reviewed by jmckenzie for CASSANDRA-7024 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/de8a479f Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/de8a479f Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/de8a479f Branch: refs/heads/cassandra-2.1 Commit: de8a479f2e1a8b536dedf2e6470301709bc3d9dc Parents: b69f5e3 Author: Yuki Morishita Authored: Tue Apr 15 17:13:45 2014 -0500 Committer: Yuki Morishita Committed: Tue Apr 15 17:13:45 2014 -0500 -- CHANGES.txt | 1 + .../apache/cassandra/db/ColumnFamilyStore.java | 18 ++- .../repair/RepairMessageVerbHandler.java| 33 +--- .../apache/cassandra/repair/SnapshotTask.java | 8 +-- .../repair/messages/RepairMessage.java | 3 +- .../repair/messages/SnapshotMessage.java| 53 6 files changed, 100 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/de8a479f/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index 592eef9..9f34023 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -45,6 +45,7 @@ * Add failure handler to async callback (CASSANDRA-6747) * Fix AE when closing SSTable without releasing reference (CASSANDRA-7000) * Clean up IndexInfo on keyspace/table drops (CASSANDRA-6924) + * Only snapshot relative SSTables when sequential repair (CASSANDRA-7024) Merged from 2.0: * Put nodes in hibernate when join_ring is false (CASSANDRA-6961) * Allow compaction of system tables during startup (CASSANDRA-6913) http://git-wip-us.apache.org/repos/asf/cassandra/blob/de8a479f/src/java/org/apache/cassandra/db/ColumnFamilyStore.java -- diff --git a/src/java/org/apache/cassandra/db/ColumnFamilyStore.java b/src/java/org/apache/cassandra/db/ColumnFamilyStore.java index ffea243..923ea5b 100644 --- a/src/java/org/apache/cassandra/db/ColumnFamilyStore.java +++ b/src/java/org/apache/cassandra/db/ColumnFamilyStore.java @@ -30,6 +30,7 @@ import javax.management.*; import com.google.common.annotations.VisibleForTesting; import com.google.common.base.Function; +import com.google.common.base.Predicate; import com.google.common.collect.*; import com.google.common.util.concurrent.*; import com.google.common.util.concurrent.Futures; @@ -2153,6 +2154,11 @@ public class ColumnFamilyStore implements ColumnFamilyStoreMBean public void snapshotWithoutFlush(String snapshotName) { +snapshotWithoutFlush(snapshotName, null); +} + +public void snapshotWithoutFlush(String snapshotName, Predicate predicate) +{ for (ColumnFamilyStore cfs : concatWithIndexes()) { DataTracker.View currentView = cfs.markCurrentViewReferenced(); @@ -2161,6 +2167,11 @@ public class ColumnFamilyStore implements ColumnFamilyStoreMBean { for (SSTableReader ssTable : currentView.sstables) { +if (predicate != null && !predicate.apply(ssTable)) +{ +continue; +} + File snapshotDirectory = Directories.getSnapshotDirectory(ssTable.descriptor, snapshotName); ssTable.createLinks(snapshotDirectory.getPath()); // hard links if (logger.isDebugEnabled()) @@ -2190,8 +2201,13 @@ public class ColumnFamilyStore implements ColumnFamilyStoreMBean */ public void snapshot(String snapshotName) { +snapshot(snapshotName, null); +} + +public void snapshot(String snapshotName, Predicate predicate) +{ forceBlockingFlush(); -snapshotWithoutFlush(snapshotName); +snapshotWithoutFlush(snapshotName, predicate); } public boolean snapshotExists(String snapshotName) http://git-wip-us.apache.org/repos/asf/cassandra/blob/de8a479f/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java -- diff --git a/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java b/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java index bb66b69..d710652 100644 --- a/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java +++ b/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java @@ -18,30 +18,32 @@ package org.apache.cassandra.repair; import java.util.ArrayList; +import java.util.Collections; impo
[3/3] git commit: Merge branch 'cassandra-2.1' into trunk
Merge branch 'cassandra-2.1' into trunk Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/2804ce99 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/2804ce99 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/2804ce99 Branch: refs/heads/trunk Commit: 2804ce9945a83a696e36b4add7a684b132fdef7c Parents: fc4ae11 de8a479 Author: Yuki Morishita Authored: Tue Apr 15 17:15:01 2014 -0500 Committer: Yuki Morishita Committed: Tue Apr 15 17:15:01 2014 -0500 -- CHANGES.txt | 1 + .../apache/cassandra/db/ColumnFamilyStore.java | 18 ++- .../repair/RepairMessageVerbHandler.java| 33 +--- .../apache/cassandra/repair/SnapshotTask.java | 8 +-- .../repair/messages/RepairMessage.java | 3 +- .../repair/messages/SnapshotMessage.java| 53 6 files changed, 100 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/2804ce99/CHANGES.txt --
[jira] [Commented] (CASSANDRA-6572) Workload recording / playback
[ https://issues.apache.org/jira/browse/CASSANDRA-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970130#comment-13970130 ] Tyler Hobbs commented on CASSANDRA-6572: It's a pretty safe patch, but as a non-essential feature I think it should be reserved for 2.1 or 3.0. > Workload recording / playback > - > > Key: CASSANDRA-6572 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6572 > Project: Cassandra > Issue Type: New Feature > Components: Core, Tools >Reporter: Jonathan Ellis >Assignee: Lyuben Todorov > Fix For: 2.0.8 > > Attachments: 6572-trunk.diff > > > "Write sample mode" gets us part way to testing new versions against a real > world workload, but we need an easy way to test the query side as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7043) CommitLogArchiver thread pool name inconsistent with others
Chris Lohfink created CASSANDRA-7043: Summary: CommitLogArchiver thread pool name inconsistent with others Key: CASSANDRA-7043 URL: https://issues.apache.org/jira/browse/CASSANDRA-7043 Project: Cassandra Issue Type: Bug Components: Core Reporter: Chris Lohfink Priority: Trivial Attachments: namechange.diff Pretty trivial... The names of all ThreadPoolExecutors are in CamelCase except the CommitLogArchiver as commitlog_archiver. This shows up a little more obvious in tpstats output: {code} nodetool tpstats Pool NameActive Pending Completed Blocked ReadStage 0 0 113702 0 RequestResponseStage 0 0 0 0 ... PendingRangeCalculator0 0 1 0 commitlog_archiver0 0 0 0 InternalResponseStage 0 0 0 0 HintedHandoff 0 0 0 0 {code} Seems minor enough to update this to be CommitLogArchiver but it may mean changes in any monitoring applications (although I don't think this particular pool has had much runtime or monitoring needs). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-5863) In process (uncompressed) page cache
[ https://issues.apache.org/jira/browse/CASSANDRA-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969927#comment-13969927 ] Benedict commented on CASSANDRA-5863: - I think there are at least three issues we're contending with here, and each need their own ticket (eventually). Putting historic data on slow drives is, I think, a different problem to putting a cache on some fast disks. Both will be helpful. Ideally I think we want the following tiers: # Uncompressed Memory Cache # Compressed Memory Cache (disjoint set from 1) # Compressed SSD cache # Regular Data # Archived/Cold/Historic Data The main distinction being the added "regular data" layer: any special "fast disk" cache should not store the full sstable hierarchy and its related files, it should just store the most popular blocks (or portions of blocks) bq. Benedict you are describing building a custom page cache impl off heap which is pretty ambitious. Don't you think a baby step would be to rely on the OS page cache to start and build a custom one as a phase II? People get very worried when they think they're competing with the kernel developers. Often for good reason, but since we don't have to be all things to all people we get the opportunity to make economies that aren't always as easily available to them. But also we only need to get roughly the same performance so we can build on this to make inroads elsewhere. What we're talking about here is pretty straight forward - it's one of the less challenging problems. A compressed page cache is more challenging, since we don't have a uniform size, but it is still probably not too difficult. Take a look at my suggestion for a key cache in CASSANDRA-6709 for a detailed description of how I would build the offheap structure. The basic approach I would probably take is this: deal with 4Kb blocks. Any blocks we read from disk larger than this we split up into 4Kb chunks and insert each into the cache separately*. The cache itself is 8- or 16-way associative, with 3 components: a long storing the LRU information for the bucket, 16-longs storing identity information for the lookup within the bucket, and corresponding positions in a large address space storing each of the 4Kb data chunks. Readers always hit the cache, and if they miss they populate the cache using the appropriate reader before continuing. Regrettably we don't have access to SIMD instructions or we could do a lot of this stuff tremendously efficiently, but even without that it should be pretty nippy. *This allows us to have a greater granularity for eviction and keeps cpu-cache traffic when reading from the cache to a minimum. It's also a pretty optimal size for reading/writing to SSD if we overflow to disk, and is a sufficiently large amount to get good compression for an in-memory compressed cache, whilst still being small enough to stream&decompress from main-memory without a major penalty to lookup a small part of it. As to having a fast disk cache, I also think this is a great idea. But I think it fits in as an extension of this and any compressed in-memory cache, as we build a tiered-cache architecture. > In process (uncompressed) page cache > > > Key: CASSANDRA-5863 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5863 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: T Jake Luciani >Assignee: Pavel Yaskevich > Labels: performance > Fix For: 2.1 beta2 > > > Currently, for every read, the CRAR reads each compressed chunk into a > byte[], sends it to ICompressor, gets back another byte[] and verifies a > checksum. > This process is where the majority of time is spent in a read request. > Before compression, we would have zero-copy of data and could respond > directly from the page-cache. > It would be useful to have some kind of Chunk cache that could speed up this > process for hot data. Initially this could be a off heap cache but it would > be great to put these decompressed chunks onto a SSD so the hot data lives on > a fast disk similar to https://github.com/facebook/flashcache. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7034) commitlog files are 32MB in size, even with a 64bit OS and jvm
[ https://issues.apache.org/jira/browse/CASSANDRA-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969891#comment-13969891 ] Benedict commented on CASSANDRA-7034: - Your statement is that your files are 32Mb in size. This is correct. On all VMs they should be 32Mb in size, and there should be at most 32 of them on a 64-bit architecture, except when the data directories are behind the commit log, in which case there can be more. On a 32-bit architecture there would be only 1 commit log file. > commitlog files are 32MB in size, even with a 64bit OS and jvm > --- > > Key: CASSANDRA-7034 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7034 > Project: Cassandra > Issue Type: Bug >Reporter: Donald Smith > > We did a rpm install of cassandra 2.0.6 on CentOS 6.4 running > {noformat} > > java -version > Java(TM) SE Runtime Environment (build 1.7.0_40-b43) > Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) > {noformat} > That is the version of java CassandraDaemon is using. > We used the default setting (None) in cassandra.yaml for > commitlog_total_space_in_mb: > {noformat} > # Total space to use for commitlogs. Since commitlog segments are > # mmapped, and hence use up address space, the default size is 32 > # on 32-bit JVMs, and 1024 on 64-bit JVMs. > # > # If space gets above this value (it will round up to the next nearest > # segment multiple), Cassandra will flush every dirty CF in the oldest > # segment and remove it. So a small total commitlog space will tend > # to cause more flush activity on less-active columnfamilies. > # commitlog_total_space_in_mb: 4096 > {noformat} > But our commitlog files are 32MB in size, not 1024MB. > OpsCenter confirms that commitlog_total_space_in_mb is None. > I don't think the problem is in cassandra-env.sh, because when I run it > manually and echo the values of the version variables I get: > {noformat} > jvmver=1.7.0_40 > JVM_VERSION=1.7.0 > JVM_ARCH=64-Bit > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7034) commitlog files are 32MB in size, even with a 64bit OS and jvm
[ https://issues.apache.org/jira/browse/CASSANDRA-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969886#comment-13969886 ] Donald Smith commented on CASSANDRA-7034: - Benedict, I'm aware that *commitlog_total_space_in_mb* has that purpose. What I'm raising is the issue that this comment in cassandra,yaml is now wrong: "the default size is 32 on 32-bit JVMs, and 1024 on 64-bit JVMs.." That's no longer being enforced. > commitlog files are 32MB in size, even with a 64bit OS and jvm > --- > > Key: CASSANDRA-7034 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7034 > Project: Cassandra > Issue Type: Bug >Reporter: Donald Smith > > We did a rpm install of cassandra 2.0.6 on CentOS 6.4 running > {noformat} > > java -version > Java(TM) SE Runtime Environment (build 1.7.0_40-b43) > Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) > {noformat} > That is the version of java CassandraDaemon is using. > We used the default setting (None) in cassandra.yaml for > commitlog_total_space_in_mb: > {noformat} > # Total space to use for commitlogs. Since commitlog segments are > # mmapped, and hence use up address space, the default size is 32 > # on 32-bit JVMs, and 1024 on 64-bit JVMs. > # > # If space gets above this value (it will round up to the next nearest > # segment multiple), Cassandra will flush every dirty CF in the oldest > # segment and remove it. So a small total commitlog space will tend > # to cause more flush activity on less-active columnfamilies. > # commitlog_total_space_in_mb: 4096 > {noformat} > But our commitlog files are 32MB in size, not 1024MB. > OpsCenter confirms that commitlog_total_space_in_mb is None. > I don't think the problem is in cassandra-env.sh, because when I run it > manually and echo the values of the version variables I get: > {noformat} > jvmver=1.7.0_40 > JVM_VERSION=1.7.0 > JVM_ARCH=64-Bit > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7040) Replace read/write stage with per-disk access coordination
[ https://issues.apache.org/jira/browse/CASSANDRA-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969882#comment-13969882 ] Benedict commented on CASSANDRA-7040: - bq. I don't think that's necessarily blocked by this work Sure - and if you want to start building one right now, go to town :) I only mean that I think it builds on the work here and in 5863, as they both involve intercepting the points at which we perform disk accesses and inserting some (minimal) coordination inebetween them. Swapping those interception points for something more intelligent is probably more straightforward once we've done that, and having a cache in which to deposit the result is _probably_ helpful too (definitely none of this is 100% essential though). > Replace read/write stage with per-disk access coordination > -- > > Key: CASSANDRA-7040 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7040 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict > Labels: performance > Fix For: 3.0 > > > As discussed in CASSANDRA-6995, current coordination of access to disk is > suboptimal: instead of ensuring disk accesses alone are coordinated, we > instead coordinate at the level of operations that may touch the disks, > ensuring only so many are proceeding at once. As such, tuning is difficult, > and we incur unnecessary delays for operations that would not touch the > disk(s). > Ideally we would instead simply use a shared coordination primitive to gate > access to the disk when we perform a rebuffer. This work would dovetail very > nicely with any work in CASSANDRA-5863, as we could prevent any blocking or > context switching for data that we know to be cached. It also, as far as I > can tell, obviates the need for CASSANDRA-5239. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6487) Log WARN on large batch sizes
[ https://issues.apache.org/jira/browse/CASSANDRA-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969880#comment-13969880 ] Benedict commented on CASSANDRA-6487: - I suggest using the ColumnFamily.dataSize() method as Aleksey suggested: in the BatchStatement.executeWithConditions() and executeWithoutConditions() methods we have access to the fully constructed ColumnFamily objects we will apply. In the former we construct a single CF _updates_, and in the latter we can iterate over each of the IMutations and call _getColumnFamilies()_. Warning on the prepared size is probably not meaningful, because it does not say anything about how big the data we're applying is. > Log WARN on large batch sizes > - > > Key: CASSANDRA-6487 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6487 > Project: Cassandra > Issue Type: Improvement >Reporter: Patrick McFadin >Assignee: Lyuben Todorov >Priority: Minor > Fix For: 2.0.8 > > Attachments: 6487_trunk.patch, 6487_trunk_v2.patch, > cassandra-2.0-6487.diff > > > Large batches on a coordinator can cause a lot of node stress. I propose > adding a WARN log entry if batch sizes go beyond a configurable size. This > will give more visibility to operators on something that can happen on the > developer side. > New yaml setting with 5k default. > {{# Log WARN on any batch size exceeding this value. 5k by default.}} > {{# Caution should be taken on increasing the size of this threshold as it > can lead to node instability.}} > {{batch_size_warn_threshold: 5k}} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6572) Workload recording / playback
[ https://issues.apache.org/jira/browse/CASSANDRA-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-6572: -- Reviewer: Tyler Hobbs WDYT [~thobbs], is this uninvasive enough to make it into 2.0? > Workload recording / playback > - > > Key: CASSANDRA-6572 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6572 > Project: Cassandra > Issue Type: New Feature > Components: Core, Tools >Reporter: Jonathan Ellis >Assignee: Lyuben Todorov > Fix For: 2.0.8 > > Attachments: 6572-trunk.diff > > > "Write sample mode" gets us part way to testing new versions against a real > world workload, but we need an easy way to test the query side as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6572) Workload recording / playback
[ https://issues.apache.org/jira/browse/CASSANDRA-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969839#comment-13969839 ] Jonathan Ellis commented on CASSANDRA-6572: --- How do you deal w/ prepared vs non-prepared queries? Thinking of CASSANDRA-7021 here. > Workload recording / playback > - > > Key: CASSANDRA-6572 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6572 > Project: Cassandra > Issue Type: New Feature > Components: Core, Tools >Reporter: Jonathan Ellis >Assignee: Lyuben Todorov > Fix For: 2.0.8 > > Attachments: 6572-trunk.diff > > > "Write sample mode" gets us part way to testing new versions against a real > world workload, but we need an easy way to test the query side as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-5863) In process (uncompressed) page cache
[ https://issues.apache.org/jira/browse/CASSANDRA-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969831#comment-13969831 ] T Jake Luciani commented on CASSANDRA-5863: --- I do think having a set of fast disks for hot data that doesn't fit into memory is key because in a large per node deployment you want: 1. Memory (Really hot data) 2. SSD (Hot data that doesn't fit in memory) 3. Spinning disk (Historic cold data) [~benedict] you are describing building a custom page cache impl off heap which is pretty ambitious. Don't you think a baby step would be to rely on the OS page cache to start and build a custom one as a phase II? What would be the page size for uncompressed data. For compressed the chunk size (conceptually) fits nicely. > In process (uncompressed) page cache > > > Key: CASSANDRA-5863 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5863 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: T Jake Luciani >Assignee: Pavel Yaskevich > Labels: performance > Fix For: 2.1 beta2 > > > Currently, for every read, the CRAR reads each compressed chunk into a > byte[], sends it to ICompressor, gets back another byte[] and verifies a > checksum. > This process is where the majority of time is spent in a read request. > Before compression, we would have zero-copy of data and could respond > directly from the page-cache. > It would be useful to have some kind of Chunk cache that could speed up this > process for hot data. Initially this could be a off heap cache but it would > be great to put these decompressed chunks onto a SSD so the hot data lives on > a fast disk similar to https://github.com/facebook/flashcache. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6985) ReadExecutors should not rely on static StorageProxy
[ https://issues.apache.org/jira/browse/CASSANDRA-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969798#comment-13969798 ] Yuki Morishita commented on CASSANDRA-6985: --- Ed, I don't see the reason to pass StorageProxy to AbstractReadExecutor at all. It is only used to get live sorted endpoints from it in getExecutor so why just pass List? I see the only reason that StorageProxy singleton instance exists right now is mainly for JMX. Is it more reasonable (for now?) to leave StorageProxy as an utility class/API and separate its managing aspect to another class? > ReadExecutors should not rely on static StorageProxy > > > Key: CASSANDRA-6985 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6985 > Project: Cassandra > Issue Type: Sub-task >Reporter: Edward Capriolo >Assignee: Edward Capriolo >Priority: Minor > Fix For: 3.0 > > Attachments: CASSANDRA_6985.1.patch > > > All the Read Executor child classes require use of the Storage Proxy to carry > out read. We can pass the StorageProxy along in the constructor. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7028) Allow C* to compile under java 8
[ https://issues.apache.org/jira/browse/CASSANDRA-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969765#comment-13969765 ] Joshua McKenzie commented on CASSANDRA-7028: Ah - good call on the runtime libraries. v4 lost the full index so file deletion failed and the diff has references to the new .jar files which prevents it applying both with the files and without. I've attached a v5 that cleans up some whitespace complaints and includes the binary, both deletion and addition. We should be able to just apply this to trunk and get all the changes - one-shot, no need to download libraries separately and place them for the committer. The diff syntax I used to build this was 'git diff --full-index --binary '. Even w/full-index, if you don't include the binary flag it won't generate the data that goes with your new files you've added and you end up with an invalid patch as it has markers to add files but no binary data to place in them. I reran tests on linux against this just to confirm changes to resolve HintedHandOffTest didn't munge with anything else and it all looks good on jdk7. I'm +1 on the v5 patch; give it a run against trunk and let me know how it works for you. > Allow C* to compile under java 8 > > > Key: CASSANDRA-7028 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7028 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Dave Brosius >Assignee: Dave Brosius >Priority: Minor > Fix For: 3.0 > > Attachments: 7028.txt, 7028_v2.txt, 7028_v3.txt, 7028_v4.txt, > 7028_v5.patch > > > antlr 3.2 has a problem with java 8, as described here: > http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8015656 > updating to antlr 3.5.2 solves this, however they have split up the jars > differently, which adds some changes, but also the generation of > CqlParser.java causes a method to be too large, so i needed to split that > method to reduce the size of it. > (patch against trunk) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7028) Allow C* to compile under java 8
[ https://issues.apache.org/jira/browse/CASSANDRA-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua McKenzie updated CASSANDRA-7028: --- Attachment: 7028_v5.patch > Allow C* to compile under java 8 > > > Key: CASSANDRA-7028 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7028 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Dave Brosius >Assignee: Dave Brosius >Priority: Minor > Fix For: 3.0 > > Attachments: 7028.txt, 7028_v2.txt, 7028_v3.txt, 7028_v4.txt, > 7028_v5.patch > > > antlr 3.2 has a problem with java 8, as described here: > http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8015656 > updating to antlr 3.5.2 solves this, however they have split up the jars > differently, which adds some changes, but also the generation of > CqlParser.java causes a method to be too large, so i needed to split that > method to reduce the size of it. > (patch against trunk) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7040) Replace read/write stage with per-disk access coordination
[ https://issues.apache.org/jira/browse/CASSANDRA-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969753#comment-13969753 ] Jason Brown commented on CASSANDRA-7040: CASSANDRA-5863 could be legit, as well :). As to intelligent "storage manager", I don't think that's necessarily blocked by this work, but I do agree it's non-trivial undertaking. > Replace read/write stage with per-disk access coordination > -- > > Key: CASSANDRA-7040 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7040 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict > Labels: performance > Fix For: 3.0 > > > As discussed in CASSANDRA-6995, current coordination of access to disk is > suboptimal: instead of ensuring disk accesses alone are coordinated, we > instead coordinate at the level of operations that may touch the disks, > ensuring only so many are proceeding at once. As such, tuning is difficult, > and we incur unnecessary delays for operations that would not touch the > disk(s). > Ideally we would instead simply use a shared coordination primitive to gate > access to the disk when we perform a rebuffer. This work would dovetail very > nicely with any work in CASSANDRA-5863, as we could prevent any blocking or > context switching for data that we know to be cached. It also, as far as I > can tell, obviates the need for CASSANDRA-5239. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7042) Disk space growth until restart
Zach Aller created CASSANDRA-7042: - Summary: Disk space growth until restart Key: CASSANDRA-7042 URL: https://issues.apache.org/jira/browse/CASSANDRA-7042 Project: Cassandra Issue Type: Bug Environment: Ubuntu 12.04 Sun Java 7 Cassandra 2.0.6 Reporter: Zach Aller Priority: Critical Cassandra will constantly eat disk space not sure whats causing it the only thing that seems to fix it is a restart of cassandra this happens about every 3-5 hrs we will grow from about 350GB to 650GB with no end in site. Once we restart cassandra it usually all clears itself up and disks return to normal for a while then something triggers its and starts climbing again. Sometimes when we restart compactions pending skyrocket and if we restart a second time the compactions pending drop off back to a normal level. One other thing to note is the space is not free'd until cassandra starts back up and not when shutdown. I will get a clean log of before and after restarting next time it happens and post it. Here is a common ERROR in our logs that might be related ERROR [CompactionExecutor:46] 2014-04-15 09:12:51,040 CassandraDaemon.java (line 196) Exception in thread Thread[CompactionExecutor:46,1,main] java.lang.RuntimeException: java.io.FileNotFoundException: /local-project/cassandra_data/data/wxgrid/grid/wxgrid-grid-jb-468677-Data.db (No such file or directory) at org.apache.cassandra.io.util.ThrottledReader.open(ThrottledReader.java:53) at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1355) at org.apache.cassandra.io.sstable.SSTableScanner.(SSTableScanner.java:67) at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1161) at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1173) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getScanners(LeveledCompactionStrategy.java:194) at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:258) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:126) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:197) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.io.FileNotFoundException: /local-project/cassandra_data/data/wxgrid/grid/wxgrid-grid-jb-468677-Data.db (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(Unknown Source) at org.apache.cassandra.io.util.RandomAccessReader.(RandomAccessReader.java:58) at org.apache.cassandra.io.util.ThrottledReader.(ThrottledReader.java:35) at org.apache.cassandra.io.util.ThrottledReader.open(ThrottledReader.java:49) ... 17 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7040) Replace read/write stage with per-disk access coordination
[ https://issues.apache.org/jira/browse/CASSANDRA-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969736#comment-13969736 ] Benedict commented on CASSANDRA-7040: - bq. You could add in helpers like mincore (and row cache) to help inform you Or CASSANDRA-5863 :-) As to batching - that's another step further along: it would be interesting to experiment with an intelligent "storage manager" that requests are submitted to, and are coordinated by, but I think that comes after 5863 + this. There's lots of ways we might be able to get improved performance with that approach, but I'm not absolutely sure they'll pan out, and they'll be a non-trivial undertaking. > Replace read/write stage with per-disk access coordination > -- > > Key: CASSANDRA-7040 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7040 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict > Labels: performance > Fix For: 3.0 > > > As discussed in CASSANDRA-6995, current coordination of access to disk is > suboptimal: instead of ensuring disk accesses alone are coordinated, we > instead coordinate at the level of operations that may touch the disks, > ensuring only so many are proceeding at once. As such, tuning is difficult, > and we incur unnecessary delays for operations that would not touch the > disk(s). > Ideally we would instead simply use a shared coordination primitive to gate > access to the disk when we perform a rebuffer. This work would dovetail very > nicely with any work in CASSANDRA-5863, as we could prevent any blocking or > context switching for data that we know to be cached. It also, as far as I > can tell, obviates the need for CASSANDRA-5239. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7040) Replace read/write stage with per-disk access coordination
[ https://issues.apache.org/jira/browse/CASSANDRA-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969726#comment-13969726 ] Jason Brown commented on CASSANDRA-7040: Martin Thompson mentions batching IO events in a talk at the react conf 2014: https://www.youtube.com/watch?v=4dfk3ucthN8 . The idea seems reasonable but I haven't investigated it yet. bq. that may touch the disks Yeah, the key word here is *may*. You could add in helpers like mincore (and row cache) to help inform you if you have nothing in memory and that you'll be going to disk. > Replace read/write stage with per-disk access coordination > -- > > Key: CASSANDRA-7040 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7040 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict > Labels: performance > Fix For: 3.0 > > > As discussed in CASSANDRA-6995, current coordination of access to disk is > suboptimal: instead of ensuring disk accesses alone are coordinated, we > instead coordinate at the level of operations that may touch the disks, > ensuring only so many are proceeding at once. As such, tuning is difficult, > and we incur unnecessary delays for operations that would not touch the > disk(s). > Ideally we would instead simply use a shared coordination primitive to gate > access to the disk when we perform a rebuffer. This work would dovetail very > nicely with any work in CASSANDRA-5863, as we could prevent any blocking or > context switching for data that we know to be cached. It also, as far as I > can tell, obviates the need for CASSANDRA-5239. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969724#comment-13969724 ] Benedict commented on CASSANDRA-6949: - bq. Only until a compaction, which will also remove stale entries. Does it? I don't see how... > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Sam Tunnicliffe > Attachments: > 0001-Remove-expansion-of-RangeTombstones-to-delete-from-2.patch, 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969716#comment-13969716 ] Benedict commented on CASSANDRA-6949: - It's worth pointing out that a sensible intersection implementation over two ordered sets can be quite efficient and a fairly low computational burden, which is possibly a good middle ground. But if there's no real risk to getting rid of it, that's probably best. > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Sam Tunnicliffe > Attachments: > 0001-Remove-expansion-of-RangeTombstones-to-delete-from-2.patch, 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969715#comment-13969715 ] Sam Tunnicliffe commented on CASSANDRA-6949: Only until a compaction, which will also remove stale entries. > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Sam Tunnicliffe > Attachments: > 0001-Remove-expansion-of-RangeTombstones-to-delete-from-2.patch, 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969709#comment-13969709 ] Benedict commented on CASSANDRA-6949: - I assume the only real risk with reverting is that if there are no reads we can get uncontrolled growth of the 2i? > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Sam Tunnicliffe > Attachments: > 0001-Remove-expansion-of-RangeTombstones-to-delete-from-2.patch, 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-6949: --- Reviewer: Jonathan Ellis (was: Sam Tunnicliffe) > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Sam Tunnicliffe > Attachments: > 0001-Remove-expansion-of-RangeTombstones-to-delete-from-2.patch, 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969692#comment-13969692 ] Sam Tunnicliffe commented on CASSANDRA-6949: That will help in the simple case where there are no indexes defined for the table, but it won't make a difference if there are. In other words, if the table has any indexes defined (including PerRowSecondaryIndexes, for which the specifics of the update are meaningless), we'll still iterate over every cell in that partition in the memtable to check it's not covered by the range tombstone. Personally, I'd prefer to revert the change to AtomicSortedColumns from CASSANDRA-5614 completely. It isn't necessary to ensure correctness in either KeysIndex or CompositesIndex as the repair-on-read behaviour cleans up any stale index entries (as does compaction). Given that, it doesn't seem worth the performance hit to ensure the 2i is kept absolutely in sync like this. Attaching a patch against 2.0 to remove the ASC changes from 5614. > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Jeremiah Jordan > Attachments: 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-6949: --- Attachment: 0001-Remove-expansion-of-RangeTombstones-to-delete-from-2.patch > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Jeremiah Jordan > Attachments: > 0001-Remove-expansion-of-RangeTombstones-to-delete-from-2.patch, 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe reassigned CASSANDRA-6949: -- Assignee: Sam Tunnicliffe (was: Jeremiah Jordan) > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Sam Tunnicliffe > Attachments: > 0001-Remove-expansion-of-RangeTombstones-to-delete-from-2.patch, 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969655#comment-13969655 ] Sergio Bossa commented on CASSANDRA-6949: - That's not enough: PRSI doesn't get notified of column-level deletes (they don't need to), so there would still be a performance regression in that case, even with that extra check. > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Jeremiah Jordan > Attachments: 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-5220) Repair improvements when using vnodes
[ https://issues.apache.org/jira/browse/CASSANDRA-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969613#comment-13969613 ] Richard Low commented on CASSANDRA-5220: It's going to be a lot slower when there's little data because there is num_tokens times as much work to do. But when there is lots of data the times should be pretty much independent of num_tokens because most of repair is spent reading data and hashing. I ran some tests when we were developing vnodes (sorry, I don't have the data still available) and this was the case. Something might have regressed though. > Repair improvements when using vnodes > - > > Key: CASSANDRA-5220 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5220 > Project: Cassandra > Issue Type: Improvement > Components: Core >Affects Versions: 1.2.0 beta 1 >Reporter: Brandon Williams >Assignee: Yuki Morishita > Fix For: 2.1 beta2 > > > Currently when using vnodes, repair takes much longer to complete than > without them. This appears at least in part because it's using a session per > range and processing them sequentially. This generates a lot of log spam > with vnodes, and while being gentler and lighter on hard disk deployments, > ssd-based deployments would often prefer that repair be as fast as possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremiah Jordan updated CASSANDRA-6949: --- Attachment: 6949.txt Looks like we actually added that check in 2.1. I don't know if there is more we want to do, but is it valid to just check {noformat} if (indexer != SecondaryIndexManager.nullUpdater && cm.deletionInfo().hasRanges()) {noformat} instead of {noformat} if (cm.deletionInfo().hasRanges()) {noformat} > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Sam Tunnicliffe > Attachments: 6949.txt > > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6949) Performance regression in tombstone heavy workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969587#comment-13969587 ] Jeremiah Jordan commented on CASSANDRA-6949: This code does't seem to check if there are actually indexes on the columns before checking all the range tombstone and isDeleted stuff. If all those checks are really needed, can we at least only do them if there is actually a 2i of some sort on the table? > Performance regression in tombstone heavy workloads > --- > > Key: CASSANDRA-6949 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6949 > Project: Cassandra > Issue Type: Bug >Reporter: Jeremiah Jordan >Assignee: Sam Tunnicliffe > > CASSANDRA-5614 causes a huge performance regression in tombstone heavy > workloads. The isDeleted checks here cause a huge CPU overhead: > https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/db/AtomicSortedColumns.java#L189-L196 > An insert workload which does perfectly fine on 1.2, pegs CPU use at 100% on > 2.0, with all of the mutation threads sitting in that loop. For example: > {noformat} > "MutationStage:20" daemon prio=10 tid=0x7fb1c4c72800 nid=0x2249 runnable > [0x7fb1b033] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.db.marshal.BytesType.bytesCompare(BytesType.java:45) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:34) > at org.apache.cassandra.db.marshal.UTF8Type.compare(UTF8Type.java:26) > at > org.apache.cassandra.db.marshal.AbstractType.compareCollectionMembers(AbstractType.java:267) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:85) > at > org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) > at > org.apache.cassandra.db.RangeTombstoneList.searchInternal(RangeTombstoneList.java:253) > at > org.apache.cassandra.db.RangeTombstoneList.isDeleted(RangeTombstoneList.java:210) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:136) > at org.apache.cassandra.db.DeletionInfo.isDeleted(DeletionInfo.java:123) > at > org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:193) > at org.apache.cassandra.db.Memtable.resolve(Memtable.java:194) > at org.apache.cassandra.db.Memtable.put(Memtable.java:158) > at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:890) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) > at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:201) > at > org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7041) Select query returns inconsistent result
Ngoc Minh Vo created CASSANDRA-7041: --- Summary: Select query returns inconsistent result Key: CASSANDRA-7041 URL: https://issues.apache.org/jira/browse/CASSANDRA-7041 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra v2.0.6 (upgraded from v2.0.3) 4-node cluster: Windows7, 12GB JVM Reporter: Ngoc Minh Vo Priority: Critical Hello, We are running in an issue with C* v2.0.x: CQL queries randomly return empty result. Here is the scenario: 1. Schema: {noformat} CREATE TABLE string_values ( date int, field text, value text, PRIMARY KEY ((date, field), value) ) WITH bloom_filter_fp_chance=0.10 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'LeveledCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'}; {noformat} 2. There is no new data imported to the cluster during the test. 3. CQL query: {noformat} select * from string_values where date=20140122 and field='SCONYKSP1'; {noformat} 4. In Cqlsh, the same query has been executed several times during a short interval (~1-2 seconds). The first query results are empty and then we got the data. And from that point, we always get the correct result: {noformat} cqlsh:titan_test> select * from string_values where date=20140122 and field='SCONYKSP1'; (0 rows) cqlsh:titan_test> select * from string_values where date=20140122 and field='SCONYKSP1'; (0 rows) ... ... cqlsh:titan_test> select * from string_values where date=20140122 and field='SCONYKSP1'; (0 rows) cqlsh:titan_test> select * from string_values where date=20140122 and field='SCONYKSP1'; (0 rows) cqlsh:titan_test> select * from string_values where date=20140122 and field='SCONYKSP1'; date | field | value --+---+- 20140122 | SCONYKSP1 | 201401220251826297a_0_3 (1 rows) cqlsh:titan_test> select * from string_values where date=20140122 and field='SCONYKSP1'; date | field | value --+---+- 20140122 | SCONYKSP1 | 201401220251826297a_0_3 (1 rows) {noformat} 5. It might relate to some kind of "warmup" process. We tried to disable key/data caching but it does not help. Upgrading cluster from v2.0.3 to v2.0.6 does not fix the issue (hence, not related to CASSANDRA-6555). Long time ago, we posted a report on Java Driver JIRA: https://datastax-oss.atlassian.net/browse/JAVA-217. But it seems that the issue is in the server side. Best regards, Minh -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6802) Row cache improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-6802: Labels: performance (was: ) > Row cache improvements > -- > > Key: CASSANDRA-6802 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6802 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Labels: performance > Fix For: 3.0 > > > There are a few things we could do; > * Start using the native memory constructs from CASSANDRA-6694 to avoid > serialization/deserialization costs and to minimize the on-heap overhead > * Stop invalidating cached rows on writes (update on write instead). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-5863) In process (uncompressed) page cache
[ https://issues.apache.org/jira/browse/CASSANDRA-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-5863: Summary: In process (uncompressed) page cache (was: Create a Decompressed Chunk [block] Cache) > In process (uncompressed) page cache > > > Key: CASSANDRA-5863 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5863 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: T Jake Luciani >Assignee: Pavel Yaskevich > Labels: performance > Fix For: 2.1 beta2 > > > Currently, for every read, the CRAR reads each compressed chunk into a > byte[], sends it to ICompressor, gets back another byte[] and verifies a > checksum. > This process is where the majority of time is spent in a read request. > Before compression, we would have zero-copy of data and could respond > directly from the page-cache. > It would be useful to have some kind of Chunk cache that could speed up this > process for hot data. Initially this could be a off heap cache but it would > be great to put these decompressed chunks onto a SSD so the hot data lives on > a fast disk similar to https://github.com/facebook/flashcache. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-6487) Log WARN on large batch sizes
[ https://issues.apache.org/jira/browse/CASSANDRA-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969501#comment-13969501 ] Lyuben Todorov edited comment on CASSANDRA-6487 at 4/15/14 1:05 PM: Just noticed that we're actually already using the memory meter for checking batch size when it might get placed into the prepared statement cache, so why not log based on that value (calculated in {{BatchStatement#measureForPreparedCache}}). As for non-prepared batch statements, there we can enforce a limit based on count of statements. was (Author: lyubent): Just noticed that we're actually already using the memory meter for checking batch size when it might get placed into the prepared statement cache, so why not log based on that value (calculated in {{BatchStatement#measureForPreparedCache}}). > Log WARN on large batch sizes > - > > Key: CASSANDRA-6487 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6487 > Project: Cassandra > Issue Type: Improvement >Reporter: Patrick McFadin >Assignee: Lyuben Todorov >Priority: Minor > Fix For: 2.0.8 > > Attachments: 6487_trunk.patch, 6487_trunk_v2.patch, > cassandra-2.0-6487.diff > > > Large batches on a coordinator can cause a lot of node stress. I propose > adding a WARN log entry if batch sizes go beyond a configurable size. This > will give more visibility to operators on something that can happen on the > developer side. > New yaml setting with 5k default. > {{# Log WARN on any batch size exceeding this value. 5k by default.}} > {{# Caution should be taken on increasing the size of this threshold as it > can lead to node instability.}} > {{batch_size_warn_threshold: 5k}} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6995) Execute local ONE/LOCAL_ONE reads on request thread instead of dispatching to read stage
[ https://issues.apache.org/jira/browse/CASSANDRA-6995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969515#comment-13969515 ] Benedict commented on CASSANDRA-6995: - I've split my suggestion out into another ticket: CASSANDRA-7040 > Execute local ONE/LOCAL_ONE reads on request thread instead of dispatching to > read stage > > > Key: CASSANDRA-6995 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6995 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.0.7 > > Attachments: 6995-v1.diff, syncread-stress.txt > > > When performing a read local to a coordinator node, AbstractReadExecutor will > create a new SP.LocalReadRunnable and drop it into the read stage for > asynchronous execution. If you are using a client that intelligently routes > read requests to a node holding the data for a given request, and are using > CL.ONE/LOCAL_ONE, the enqueuing SP.LocalReadRunnable and waiting for the > context switches (and possible NUMA misses) adds unneccesary latency. We can > reduce that latency and improve throughput by avoiding the queueing and > thread context switching by simply executing the SP.LocalReadRunnable > synchronously in the request thread. Testing on a three node cluster (each > with 32 cpus, 132 GB ram) yields ~10% improvement in throughput and ~20% > speedup on avg/95/99 percentiles (99.9% was about 5-10% improvement). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969513#comment-13969513 ] Jason Brown commented on CASSANDRA-4718: OK, will give it a shot today. Also, just noticed I did not tune native_transport_max_threads at all (so I have the default of 128). Might play with that a bit, as well. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7040) Replace read/write stage with per-disk access coordination
[ https://issues.apache.org/jira/browse/CASSANDRA-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969514#comment-13969514 ] Benedict commented on CASSANDRA-7040: - Further, once we have this, we can experiment with periodically locking access to the disks (for short, say 20-50ms periods) in order to let compactions/flushes catch up with any outstanding work, if they appear to be getting behind. > Replace read/write stage with per-disk access coordination > -- > > Key: CASSANDRA-7040 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7040 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict > Labels: performance > Fix For: 3.0 > > > As discussed in CASSANDRA-6995, current coordination of access to disk is > suboptimal: instead of ensuring disk accesses alone are coordinated, we > instead coordinate at the level of operations that may touch the disks, > ensuring only so many are proceeding at once. As such, tuning is difficult, > and we incur unnecessary delays for operations that would not touch the > disk(s). > Ideally we would instead simply use a shared coordination primitive to gate > access to the disk when we perform a rebuffer. This work would dovetail very > nicely with any work in CASSANDRA-5863, as we could prevent any blocking or > context switching for data that we know to be cached. It also, as far as I > can tell, obviates the need for CASSANDRA-5239. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7040) Replace read/write stage with per-disk access coordination
Benedict created CASSANDRA-7040: --- Summary: Replace read/write stage with per-disk access coordination Key: CASSANDRA-7040 URL: https://issues.apache.org/jira/browse/CASSANDRA-7040 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Fix For: 3.0 As discussed in CASSANDRA-6995, current coordination of access to disk is suboptimal: instead of ensuring disk accesses alone are coordinated, we instead coordinate at the level of operations that may touch the disks, ensuring only so many are proceeding at once. As such, tuning is difficult, and we incur unnecessary delays for operations that would not touch the disk(s). Ideally we would instead simply use a shared coordination primitive to gate access to the disk when we perform a rebuffer. This work would dovetail very nicely with any work in CASSANDRA-5863, as we could prevent any blocking or context switching for data that we know to be cached. It also, as far as I can tell, obviates the need for CASSANDRA-5239. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6487) Log WARN on large batch sizes
[ https://issues.apache.org/jira/browse/CASSANDRA-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969501#comment-13969501 ] Lyuben Todorov commented on CASSANDRA-6487: --- Just noticed that we're actually already using the memory meter for checking batch size when it might get placed into the prepared statement cache, so why not log based on that value (calculated in {{BatchStatement#measureForPreparedCache}}). > Log WARN on large batch sizes > - > > Key: CASSANDRA-6487 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6487 > Project: Cassandra > Issue Type: Improvement >Reporter: Patrick McFadin >Assignee: Lyuben Todorov >Priority: Minor > Fix For: 2.0.8 > > Attachments: 6487_trunk.patch, 6487_trunk_v2.patch, > cassandra-2.0-6487.diff > > > Large batches on a coordinator can cause a lot of node stress. I propose > adding a WARN log entry if batch sizes go beyond a configurable size. This > will give more visibility to operators on something that can happen on the > developer side. > New yaml setting with 5k default. > {{# Log WARN on any batch size exceeding this value. 5k by default.}} > {{# Caution should be taken on increasing the size of this threshold as it > can lead to node instability.}} > {{batch_size_warn_threshold: 5k}} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6755) Optimise CellName/Composite comparisons for NativeCell
[ https://issues.apache.org/jira/browse/CASSANDRA-6755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969500#comment-13969500 ] Benedict commented on CASSANDRA-6755: - An ideal solution would probably be modelled on the util.FastByteOperations class > Optimise CellName/Composite comparisons for NativeCell > -- > > Key: CASSANDRA-6755 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6755 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 3.0 > > > As discussed in CASSANDRA-6694, to reduce temporary garbage generation we > should minimise the incidence of CellName component extraction. The biggest > win will be to perform comparisons on Cell where possible, instead of > CellName, so that Native*Cell can use its extra information to avoid creating > any ByteBuffer objects -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6755) Optimise CellName/Composite comparisons for NativeCell
[ https://issues.apache.org/jira/browse/CASSANDRA-6755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-6755: Summary: Optimise CellName/Composite comparisons for NativeCell (was: Minimise extraction of CellName components) > Optimise CellName/Composite comparisons for NativeCell > -- > > Key: CASSANDRA-6755 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6755 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 3.0 > > > As discussed in CASSANDRA-6694, to reduce temporary garbage generation we > should minimise the incidence of CellName component extraction. The biggest > win will be to perform comparisons on Cell where possible, instead of > CellName, so that Native*Cell can use its extra information to avoid creating > any ByteBuffer objects -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7039) DirectByteBuffer compatible LZ4 methods
[ https://issues.apache.org/jira/browse/CASSANDRA-7039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7039: Fix Version/s: 3.0 > DirectByteBuffer compatible LZ4 methods > --- > > Key: CASSANDRA-7039 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7039 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Priority: Minor > Labels: performance > Fix For: 3.0 > > > As we move more things off-heap, it's becoming more and more essential to be > able to use DirectByteBuffer (or native pointers) in various places. > Unfortunately LZ4 doesn't currently support this operation, despite being JNI > based - this means we both have to perform unnecessary copies to de/compress > data from DBB, but also we can stall GC as any JNI method operating over a > java array using the GetPrimitiveArrayCritical enters a critical section that > prevents GC for its duration. This means STWs will be at least as long any > running compression/decompression (and no GC will happen until they complete, > so it's additive). > We should temporarily fork (and then resubmit upstream) jpountz-lz4 to > support operating over a native pointer, so that we can pass a DBB or a raw > pointer we have allocated ourselves. This will help improve performance when > flushing the new offheap memtables, as well as enable us to implement > CASSANDRA-6726 and finish CASSANDRA-4338. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (CASSANDRA-5020) Time to switch back to byte[] internally?
[ https://issues.apache.org/jira/browse/CASSANDRA-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-5020. - Resolution: Not a Problem This has most likely become "not a problem" as a result of movement towards off-heap memtables + cells, which bring the overheads down as low as we can go with a per-cell data structure. > Time to switch back to byte[] internally? > - > > Key: CASSANDRA-5020 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5020 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Jonathan Ellis >Assignee: T Jake Luciani > Labels: performance > Fix For: 3.0 > > > We switched to ByteBuffer for column names and values back in 0.7, which gave > us a short term performance boost on mmap'd reads, but we gave that up when > we switched to refcounted sstables in 1.0. (refcounting all the way up the > read path would be too painful, so we copy into an on-heap buffer when > reading from an sstable, then release the reference.) > A HeapByteBuffer wastes a lot of memory compared to a byte[] (5 more ints, a > long, and a boolean). > The hard problem here is how to do the arena allocation we do on writes, > which has been very successful in reducing STW CMS from heap fragmentation. > ByteBuffer is a good fit there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7039) DirectByteBuffer compatible LZ4 methods
Benedict created CASSANDRA-7039: --- Summary: DirectByteBuffer compatible LZ4 methods Key: CASSANDRA-7039 URL: https://issues.apache.org/jira/browse/CASSANDRA-7039 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Priority: Minor As we move more things off-heap, it's becoming more and more essential to be able to use DirectByteBuffer (or native pointers) in various places. Unfortunately LZ4 doesn't currently support this operation, despite being JNI based - this means we both have to perform unnecessary copies to de/compress data from DBB, but also we can stall GC as any JNI method operating over a java array using the GetPrimitiveArrayCritical enters a critical section that prevents GC for its duration. This means STWs will be at least as long any running compression/decompression (and no GC will happen until they complete, so it's additive). We should temporarily fork (and then resubmit upstream) jpountz-lz4 to support operating over a native pointer, so that we can pass a DBB or a raw pointer we have allocated ourselves. This will help improve performance when flushing the new offheap memtables, as well as enable us to implement CASSANDRA-6726 and finish CASSANDRA-4338. -- This message was sent by Atlassian JIRA (v6.2#6252)
[1/2] git commit: Clean up IndexInfo on keyspace/table drops
Repository: cassandra Updated Branches: refs/heads/trunk 6e97178a5 -> fc4ae115a Clean up IndexInfo on keyspace/table drops patch by Sam Tunnicliffe; reviewed by Aleksey Yeschenko for CASSANDRA-6924 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/b69f5e36 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/b69f5e36 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/b69f5e36 Branch: refs/heads/trunk Commit: b69f5e363b75543429a25b0909b45dff735c64b2 Parents: 6658a6e Author: beobal Authored: Mon Apr 14 20:08:31 2014 +0100 Committer: Aleksey Yeschenko Committed: Tue Apr 15 15:17:58 2014 +0300 -- CHANGES.txt | 1 + src/java/org/apache/cassandra/config/CFMetaData.java | 6 ++ src/java/org/apache/cassandra/config/KSMetaData.java | 1 + 3 files changed, 8 insertions(+) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/b69f5e36/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index d7c6e71..592eef9 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -44,6 +44,7 @@ * Ensure safe resource cleanup when replacing sstables (CASSANDRA-6912) * Add failure handler to async callback (CASSANDRA-6747) * Fix AE when closing SSTable without releasing reference (CASSANDRA-7000) + * Clean up IndexInfo on keyspace/table drops (CASSANDRA-6924) Merged from 2.0: * Put nodes in hibernate when join_ring is false (CASSANDRA-6961) * Allow compaction of system tables during startup (CASSANDRA-6913) http://git-wip-us.apache.org/repos/asf/cassandra/blob/b69f5e36/src/java/org/apache/cassandra/config/CFMetaData.java -- diff --git a/src/java/org/apache/cassandra/config/CFMetaData.java b/src/java/org/apache/cassandra/config/CFMetaData.java index e930de4..72a0fc5 100644 --- a/src/java/org/apache/cassandra/config/CFMetaData.java +++ b/src/java/org/apache/cassandra/config/CFMetaData.java @@ -1585,6 +1585,12 @@ public final class CFMetaData for (TriggerDefinition td : triggers.values()) td.deleteFromSchema(mutation, cfName, timestamp); +for (String indexName : Keyspace.open(this.ksName).getColumnFamilyStore(this.cfName).getBuiltIndexes()) +{ +ColumnFamily indexCf = mutation.addOrGet(IndexCf); + indexCf.addTombstone(indexCf.getComparator().makeCellName(indexName), ldt, timestamp); +} + return mutation; } http://git-wip-us.apache.org/repos/asf/cassandra/blob/b69f5e36/src/java/org/apache/cassandra/config/KSMetaData.java -- diff --git a/src/java/org/apache/cassandra/config/KSMetaData.java b/src/java/org/apache/cassandra/config/KSMetaData.java index 3d1edb6..d0cb613 100644 --- a/src/java/org/apache/cassandra/config/KSMetaData.java +++ b/src/java/org/apache/cassandra/config/KSMetaData.java @@ -242,6 +242,7 @@ public final class KSMetaData mutation.delete(SystemKeyspace.SCHEMA_COLUMNS_CF, timestamp); mutation.delete(SystemKeyspace.SCHEMA_TRIGGERS_CF, timestamp); mutation.delete(SystemKeyspace.SCHEMA_USER_TYPES_CF, timestamp); +mutation.delete(SystemKeyspace.INDEX_CF, timestamp); return mutation; }
[2/2] git commit: Merge branch 'cassandra-2.1' into trunk
Merge branch 'cassandra-2.1' into trunk Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/fc4ae115 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/fc4ae115 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/fc4ae115 Branch: refs/heads/trunk Commit: fc4ae115ac94b1599d308956590672eaca49e64d Parents: 6e97178 b69f5e3 Author: Aleksey Yeschenko Authored: Tue Apr 15 15:23:12 2014 +0300 Committer: Aleksey Yeschenko Committed: Tue Apr 15 15:23:12 2014 +0300 -- CHANGES.txt | 1 + src/java/org/apache/cassandra/config/CFMetaData.java | 6 ++ src/java/org/apache/cassandra/config/KSMetaData.java | 1 + 3 files changed, 8 insertions(+) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/fc4ae115/CHANGES.txt -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/fc4ae115/src/java/org/apache/cassandra/config/CFMetaData.java --
git commit: Clean up IndexInfo on keyspace/table drops
Repository: cassandra Updated Branches: refs/heads/cassandra-2.1 6658a6e03 -> b69f5e363 Clean up IndexInfo on keyspace/table drops patch by Sam Tunnicliffe; reviewed by Aleksey Yeschenko for CASSANDRA-6924 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/b69f5e36 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/b69f5e36 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/b69f5e36 Branch: refs/heads/cassandra-2.1 Commit: b69f5e363b75543429a25b0909b45dff735c64b2 Parents: 6658a6e Author: beobal Authored: Mon Apr 14 20:08:31 2014 +0100 Committer: Aleksey Yeschenko Committed: Tue Apr 15 15:17:58 2014 +0300 -- CHANGES.txt | 1 + src/java/org/apache/cassandra/config/CFMetaData.java | 6 ++ src/java/org/apache/cassandra/config/KSMetaData.java | 1 + 3 files changed, 8 insertions(+) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/b69f5e36/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index d7c6e71..592eef9 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -44,6 +44,7 @@ * Ensure safe resource cleanup when replacing sstables (CASSANDRA-6912) * Add failure handler to async callback (CASSANDRA-6747) * Fix AE when closing SSTable without releasing reference (CASSANDRA-7000) + * Clean up IndexInfo on keyspace/table drops (CASSANDRA-6924) Merged from 2.0: * Put nodes in hibernate when join_ring is false (CASSANDRA-6961) * Allow compaction of system tables during startup (CASSANDRA-6913) http://git-wip-us.apache.org/repos/asf/cassandra/blob/b69f5e36/src/java/org/apache/cassandra/config/CFMetaData.java -- diff --git a/src/java/org/apache/cassandra/config/CFMetaData.java b/src/java/org/apache/cassandra/config/CFMetaData.java index e930de4..72a0fc5 100644 --- a/src/java/org/apache/cassandra/config/CFMetaData.java +++ b/src/java/org/apache/cassandra/config/CFMetaData.java @@ -1585,6 +1585,12 @@ public final class CFMetaData for (TriggerDefinition td : triggers.values()) td.deleteFromSchema(mutation, cfName, timestamp); +for (String indexName : Keyspace.open(this.ksName).getColumnFamilyStore(this.cfName).getBuiltIndexes()) +{ +ColumnFamily indexCf = mutation.addOrGet(IndexCf); + indexCf.addTombstone(indexCf.getComparator().makeCellName(indexName), ldt, timestamp); +} + return mutation; } http://git-wip-us.apache.org/repos/asf/cassandra/blob/b69f5e36/src/java/org/apache/cassandra/config/KSMetaData.java -- diff --git a/src/java/org/apache/cassandra/config/KSMetaData.java b/src/java/org/apache/cassandra/config/KSMetaData.java index 3d1edb6..d0cb613 100644 --- a/src/java/org/apache/cassandra/config/KSMetaData.java +++ b/src/java/org/apache/cassandra/config/KSMetaData.java @@ -242,6 +242,7 @@ public final class KSMetaData mutation.delete(SystemKeyspace.SCHEMA_COLUMNS_CF, timestamp); mutation.delete(SystemKeyspace.SCHEMA_TRIGGERS_CF, timestamp); mutation.delete(SystemKeyspace.SCHEMA_USER_TYPES_CF, timestamp); +mutation.delete(SystemKeyspace.INDEX_CF, timestamp); return mutation; }
[jira] [Commented] (CASSANDRA-4718) More-efficient ExecutorService for improved throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969476#comment-13969476 ] Benedict commented on CASSANDRA-4718: - [~jasobrown]: Could you upload the full stress outputs for these runs? And also try running a separate stress run with a fixed high threadcount and op count? In particular for CQL, the results in the file are a little bit weird. That said, given their consistency for thrift I don't doubt the result is meaningful, but it would be good to understand what we're incorporating a bit better before committing. > More-efficient ExecutorService for improved throughput > -- > > Key: CASSANDRA-4718 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4718 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 2.1 > > Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op > costs of various queues.ods, stress op rate with various queues.ods, > v1-stress.out > > > Currently all our execution stages dequeue tasks one at a time. This can > result in contention between producers and consumers (although we do our best > to minimize this by using LinkedBlockingQueue). > One approach to mitigating this would be to make consumer threads do more > work in "bulk" instead of just one task per dequeue. (Producer threads tend > to be single-task oriented by nature, so I don't see an equivalent > opportunity there.) > BlockingQueue has a drainTo(collection, int) method that would be perfect for > this. However, no ExecutorService in the jdk supports using drainTo, nor > could I google one. > What I would like to do here is create just such a beast and wire it into (at > least) the write and read stages. (Other possible candidates for such an > optimization, such as the CommitLog and OutboundTCPConnection, are not > ExecutorService-based and will need to be one-offs.) > AbstractExecutorService may be useful. The implementations of > ICommitLogExecutorService may also be useful. (Despite the name these are not > actual ExecutorServices, although they share the most important properties of > one.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6924) Data Inserted Immediately After Secondary Index Creation is not Indexed
[ https://issues.apache.org/jira/browse/CASSANDRA-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-6924: --- Attachment: 6924-2.1.txt > Data Inserted Immediately After Secondary Index Creation is not Indexed > --- > > Key: CASSANDRA-6924 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6924 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Tyler Hobbs >Assignee: Sam Tunnicliffe > Fix For: 2.0.7 > > Attachments: 6924-2.1.txt, repro.py > > > The head of the cassandra-1.2 branch (currently 1.2.16-tentative) contains a > regression from 1.2.15. Data that is inserted immediately after secondary > index creation may never get indexed. > You can reproduce the issue with a [pycassa integration > test|https://github.com/pycassa/pycassa/blob/master/tests/test_autopacking.py#L793] > by running: > {noformat} > nosetests tests/test_autopacking.py:TestKeyValidators.test_get_indexed_slices > {noformat} > from the pycassa directory. > The operation order goes like this: > # create CF > # create secondary index > # insert data > # query secondary index > If a short sleep is added in between steps 2 and 3, the data gets indexed and > the query is successful. > If a sleep is only added in between steps 3 and 4, some of the data is never > indexed and the query will return incomplete results. This appears to be the > case even if the sleep is relatively long (30s), which makes me think the > data may never get indexed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6924) Data Inserted Immediately After Secondary Index Creation is not Indexed
[ https://issues.apache.org/jira/browse/CASSANDRA-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969428#comment-13969428 ] Sam Tunnicliffe commented on CASSANDRA-6924: This doesn't seem like a regression as the repro script fails for me just as consistently on 1.2.15 as it does on later versions. The issue appears to be that when a ks or cf is dropped, we don't update system.IndexInfo to remove the entry for the 2i. Then when the ks/cf & index are recreated, we treat the index creation not as a brand new index, but as if we're restarting and linking in an existing index to the cf. So we skip the buildIndexAsync call that we should make which is what causes some entries to never get indexed. Fixing this so that we do clean up IndexInfo leads to us running into CASSANDRA-5202 on pre-2.1 branches. On 2.1, we see the issues mentioned in CASSANDRA-6959 so as Sylvain suggests there, the test needs to be changed to wait for schema agreement. This can be acheived with a 1s wait, or by actively testing for agreement. Now that the buildIndexAsync call is happening on index initialisation, we can insert this wait in one of two places: between the index creation and the inserts, or between the inserts and the reads. I've updated the dtest accordingly and added another variant which drops just the cf, rather than the entire ks (https://github.com/riptano/cassandra-dtest/pull/40). I do still see the errors from {{CommitLogSegmentManager}} on 2.1 detailed on CASSANDRA-6959 even after applying the patch attached to that issue. Likewise, using Tyler's original repro script, a 1s sleep before commencing the reads is now enough to ensure the run succeeds (on the 2.1 branch). On trunk, I get completely different errors running both the dtest & repro.py, both with and without the IndexInfo fix: {code} ERROR [Thrift:1] 2014-04-14 15:45:10,714 CustomTThreadPoolServer.java:212 - Error occurred during processing of message. java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: fromIndex(34) > toIndex(25) at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:411) ~[main/:na] at org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:281) ~[main/:na] at org.apache.cassandra.service.MigrationManager.announceColumnFamilyUpdate(MigrationManager.java:242) ~[main/:na] at org.apache.cassandra.cql3.statements.CreateIndexStatement.announceMigration(CreateIndexStatement.java:141) ~[main/:na] at org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:71) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:180) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:214) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[main/:na] at org.apache.cassandra.thrift.CassandraServer.execute_cql3_query(CassandraServer.java:1973) ~[main/:na] at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql3_query.getResult(Cassandra.java:4486) ~[thrift/:na] at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql3_query.getResult(Cassandra.java:4470) ~[thrift/:na] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:194) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51] Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: fromIndex(34) > toIndex(25) at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.7.0_51] at java.util.concurrent.FutureTask.get(FutureTask.java:188) ~[na:1.7.0_51] at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:407) ~[main/:na] ... 16 common frames omitted Caused by: java.lang.IllegalArgumentException: fromIndex(34) > toIndex(25) at java.util.TimSort.rangeCheck(TimSort.java:921) ~[na:1.7.0_51] at java.util.TimSort.sort(TimSort.java:182) ~[na:1.7.0_51] at java.util.Arrays.sort(Arrays.java:727) ~[na:1.7.0_51] at org.apache.cassandra.db.ArrayBackedSortedColumns.sortCells(ArrayBackedSortedColumns.java:113) ~[main/:na] at org.apache.cassandra.db.ArrayBackedSortedColumns.maybeSortCells(ArrayBackedSortedColumns.j
[jira] [Commented] (CASSANDRA-7030) Remove JEMallocAllocator
[ https://issues.apache.org/jira/browse/CASSANDRA-7030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969419#comment-13969419 ] Benedict commented on CASSANDRA-7030: - bq. This leads to the CLHM not obeying its limits as readily as it is asked to Confirmed that the problem I am seeing with concurrent execution (and that I would guess is leading to your test results) is down to CLHM. By replacing the CLHM with an AtomicReferenceArray to guarantee the bounds I get: {noformat} concurrent malloc: Total Elapsed: 9.708s Allocate Elapsed: 21.271s Free Elapsed: 26.023s Total Allocated: 62483Mb Rate: 1.290Gb/s Live Allocated: 1020Mb VM total:117 vsz: 3149 rsz: 1280 synchronized malloc: Total Elapsed: 36.526s Allocate Elapsed: 134.114s Free Elapsed: 128.416s Total Allocated: 62483Mb Rate: 0.232Gb/s Live Allocated: 1020Mb VM total:117 vsz: 3213 rsz: 1427 synchronized jemalloc: Total Elapsed: 217.113s Allocate Elapsed: 162.753s Free Elapsed: 1531.215s Total Allocated: 62483Mb Rate: 0.036Gb/s Live Allocated: 1020Mb VM total:70 vsz: 4084 rsz: 1410 {noformat} Can you rerun your test with either synchronised malloc, or with an AtomicReferenceArray instead of the CLHM, to confirm? Note I have reverted my position back to "let's get rid of jemalloc" - without more evidence to the contrary: the test I was running that initiated the creation of this ticket was measuring elapsed time for both allocate() *and* free(), and I dropped the latter from the tests based on your benchmark because it's difficult to time the free() calls (as they live in the eviction listener). Now I am timing both, and you can see the real-elapsed time and per-CPU elapsed times are dramatically higher for jemalloc once both are included. The cost of calling free() appears to be disproportionately higher for jemalloc. Note the throughput rate for jemalloc: 36Mb/s. This is really really pathetic! > Remove JEMallocAllocator > > > Key: CASSANDRA-7030 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7030 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1 beta2 > > Attachments: 7030.txt > > > JEMalloc, whilst having some nice performance properties by comparison to > Doug Lea's standard malloc algorithm in principle, is pointless in practice > because of the JNA cost. In general it is around 30x more expensive to call > than unsafe.allocate(); malloc does not have a variability of response time > as extreme as the JNA overhead, so using JEMalloc in Cassandra is never a > sensible idea. I doubt if custom JNI would make it worthwhile either. > I propose removing it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-3680) Add Support for Composite Secondary Indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-3680: --- Attachment: (was: 7038-1.2.txt) > Add Support for Composite Secondary Indexes > --- > > Key: CASSANDRA-3680 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3680 > Project: Cassandra > Issue Type: Sub-task >Reporter: T Jake Luciani >Assignee: Sylvain Lebresne > Labels: cql3, secondary_index > Fix For: 1.2.0 beta 1 > > Attachments: 0001-Secondary-indexes-on-composite-columns.txt > > > CASSANDRA-2474 and CASSANDRA-3647 add the ability to transpose wide rows > differently, for efficiency and functionality secondary index api needs to be > altered to allow composite indexes. > I think this will require the IndexManager api to have a > maybeIndex(ByteBuffer column) method that SS can call and implement a > PerRowSecondaryIndex per column, break the composite into parts and index > specific bits, also including the base rowkey. > Then a search against a TRANSPOSED row or DOCUMENT will be possible. > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-3680) Add Support for Composite Secondary Indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-3680: --- Attachment: (was: 7038-2.1.txt) > Add Support for Composite Secondary Indexes > --- > > Key: CASSANDRA-3680 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3680 > Project: Cassandra > Issue Type: Sub-task >Reporter: T Jake Luciani >Assignee: Sylvain Lebresne > Labels: cql3, secondary_index > Fix For: 1.2.0 beta 1 > > Attachments: 0001-Secondary-indexes-on-composite-columns.txt > > > CASSANDRA-2474 and CASSANDRA-3647 add the ability to transpose wide rows > differently, for efficiency and functionality secondary index api needs to be > altered to allow composite indexes. > I think this will require the IndexManager api to have a > maybeIndex(ByteBuffer column) method that SS can call and implement a > PerRowSecondaryIndex per column, break the composite into parts and index > specific bits, also including the base rowkey. > Then a search against a TRANSPOSED row or DOCUMENT will be possible. > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7038) Nodetool rebuild_index requires named indexes argument
[ https://issues.apache.org/jira/browse/CASSANDRA-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-7038: --- Attachment: 7038-2.1.txt 7038-1.2.txt > Nodetool rebuild_index requires named indexes argument > -- > > Key: CASSANDRA-7038 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7038 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Sam Tunnicliffe >Assignee: Sam Tunnicliffe >Priority: Trivial > Attachments: 7038-1.2.txt, 7038-2.1.txt > > > In addition to explicitly listing the indexes to be rebuilt, nodetool > rebuild_indexes will also accept just keyspace & columnfamily arguments, > indicating that all indexes for that ks/cf should be rebuilt. > This doesn't actually work as CFS.rebuildSecondaryIndex requires the explicit > list. In the 2 arg version, nodetool just passes an empty list here and so > the rebuild becomes a no-op. As this has been the case since CASSANDRA-3860 > (AFAICT, 80ea03f is the commit that removed this) we may as well just remove > the option from nodetool, patch attached to do that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-3680) Add Support for Composite Secondary Indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-3680: --- Attachment: 7038-2.1.txt 7038-1.2.txt > Add Support for Composite Secondary Indexes > --- > > Key: CASSANDRA-3680 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3680 > Project: Cassandra > Issue Type: Sub-task >Reporter: T Jake Luciani >Assignee: Sylvain Lebresne > Labels: cql3, secondary_index > Fix For: 1.2.0 beta 1 > > Attachments: 0001-Secondary-indexes-on-composite-columns.txt > > > CASSANDRA-2474 and CASSANDRA-3647 add the ability to transpose wide rows > differently, for efficiency and functionality secondary index api needs to be > altered to allow composite indexes. > I think this will require the IndexManager api to have a > maybeIndex(ByteBuffer column) method that SS can call and implement a > PerRowSecondaryIndex per column, break the composite into parts and index > specific bits, also including the base rowkey. > Then a search against a TRANSPOSED row or DOCUMENT will be possible. > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7038) Nodetool rebuild_index requires named indexes argument
Sam Tunnicliffe created CASSANDRA-7038: -- Summary: Nodetool rebuild_index requires named indexes argument Key: CASSANDRA-7038 URL: https://issues.apache.org/jira/browse/CASSANDRA-7038 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Sam Tunnicliffe Assignee: Sam Tunnicliffe Priority: Trivial In addition to explicitly listing the indexes to be rebuilt, nodetool rebuild_indexes will also accept just keyspace & columnfamily arguments, indicating that all indexes for that ks/cf should be rebuilt. This doesn't actually work as CFS.rebuildSecondaryIndex requires the explicit list. In the 2 arg version, nodetool just passes an empty list here and so the rebuild becomes a no-op. As this has been the case since CASSANDRA-3860 (AFAICT, 80ea03f is the commit that removed this) we may as well just remove the option from nodetool, patch attached to do that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (CASSANDRA-7030) Remove JEMallocAllocator
[ https://issues.apache.org/jira/browse/CASSANDRA-7030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reopened CASSANDRA-7030: - I think there's actually a couple of questions we should answer before closing the ticket: 1) without JNI, should we be supporting jemalloc (it is slower and has higher overheads in all comparable workloads we can test)? 2) should we be synchronising on malloc/free for jemalloc? Or do we simply hope the user has compiled jemalloc in a manner that avoids the issue? > Remove JEMallocAllocator > > > Key: CASSANDRA-7030 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7030 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1 beta2 > > Attachments: 7030.txt > > > JEMalloc, whilst having some nice performance properties by comparison to > Doug Lea's standard malloc algorithm in principle, is pointless in practice > because of the JNA cost. In general it is around 30x more expensive to call > than unsafe.allocate(); malloc does not have a variability of response time > as extreme as the JNA overhead, so using JEMalloc in Cassandra is never a > sensible idea. I doubt if custom JNI would make it worthwhile either. > I propose removing it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7030) Remove JEMallocAllocator
[ https://issues.apache.org/jira/browse/CASSANDRA-7030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969377#comment-13969377 ] Benedict commented on CASSANDRA-7030: - FTR, though, I think the problem with your test is that jemalloc is synchronised and malloc is not. This leads to the CLHM not obeying its limits as readily as it is asked to (seems to keep ~ 3x as much data around in my test): {noformat} concurrent malloc: Elapsed: 55.433s Allocated: 2973Mb VM total:177 vsz: 6221 rsz: 4501 synchronized malloc: Elapsed: 96.507s Allocated: 1026Mb VM total:187 vsz: 3341 rsz: 1681 synchronized jemalloc: Elapsed: 263.686s Allocated: 1027Mb VM total:192 vsz: 3628 rsz: 1525 {noformat} and for posterity, the code I was running: {code} public static void main(String[] args) throws InterruptedException, IOException { String pid = ManagementFactory.getRuntimeMXBean().getName().split("@")[0]; final IAllocator allocator = new NativeAllocator(); final AtomicLong total = new AtomicLong(); EvictionListener listener = new EvictionListener() { public void onEviction(UUID k, Memory mem) { total.addAndGet(-mem.size()); mem.free(allocator); } }; final Map map = new ConcurrentLinkedHashMap.Builder().weigher(Weighers. singleton()) .initialCapacity(8 * 65536).maximumWeightedCapacity(2 * 65536) .listener(listener).build(); final AtomicLong elapsed = new AtomicLong(); final AtomicLong count = new AtomicLong(); final ExecutorService exec = Executors.newFixedThreadPool(8); for (int i = 0 ; i < 8 ; i++) { final Random rand = new Random(i); exec.execute(new Runnable() { public void run() { byte[] keyBytes = new byte[16]; for (int i = 0; i < 100; i++) { int size = rand.nextInt(128 * 128); if (size <= 0) continue; rand.nextBytes(keyBytes); long start = System.nanoTime(); Memory mem = new Memory(allocator, size); elapsed.addAndGet(System.nanoTime() - start); mem.setMemory(0, mem.size(), (byte) 2); Memory r = map.put(UUID.nameUUIDFromBytes(keyBytes), mem); if (r != null) r.free(); total.addAndGet(size); if (count.incrementAndGet() % 1000 == 0) System.out.println("1M"); } } }); } exec.shutdown(); exec.awaitTermination(1L, TimeUnit.HOURS); System.out.println(String.format("Elapsed: %.3fs", elapsed.get() * 0.1d)); System.out.println(String.format("Allocated: %.0fMb", total.get() / (double) (1 << 20))); System.out.println(String.format("VM total:%.0f", Runtime.getRuntime().totalMemory() / (double) (1 << 20))); memuse("vsz", pid); memuse("rsz", pid); Thread.sleep(100); } private static void memuse(String type, String pid) throws IOException { Process p = new ProcessBuilder().command("ps", "-o", type, pid).redirectErrorStream(true).start(); BufferedReader reader = new BufferedReader(new InputStreamReader(p.getInputStream())); reader.readLine(); System.out.println(String.format("%s: %.0f", type, Integer.parseInt(reader.readLine()) / 1024d)); } {code} > Remove JEMallocAllocator > > > Key: CASSANDRA-7030 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7030 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Labels: performance > Fix For: 2.1 beta2 > > Attachments: 7030.txt > > > JEMalloc, whilst having some nice performance properties by comparison to > Doug Lea's standard malloc algorithm in principle, is pointless in practice > because of the JNA cost. In general it is around 30x more expensive to call > than unsafe.allocate(); malloc does not have a variability of response time > as extreme as the JNA overhead, so using JEMalloc in Cassandra is never a > sensible idea. I doubt if custom JNI would make it worthwhile either. > I propose removing it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can cause data to reappear.
[ https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969333#comment-13969333 ] Marcus Eriksson commented on CASSANDRA-6696: pushed a new version to https://github.com/krummas/cassandra/commits/marcuse/6696-3 which; * adds nodetool command to rebalance data over disks so that user can do this whenever they want (like after manually adding sstables to the data directories) * removes diskawarewriter from everything but streams and the rebalancing command * makes the flush executor an array of executors. * splits ranges based on total partitioner range and makes this feature vnodes-only * supports the old way of doing things for non-vnodes setup (and ordered partitioners) there are still some of my config-changes left in as i bet there will be more comments on this > Drive replacement in JBOD can cause data to reappear. > -- > > Key: CASSANDRA-6696 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6696 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: sankalp kohli >Assignee: Marcus Eriksson > Fix For: 3.0 > > > In JBOD, when someone gets a bad drive, the bad drive is replaced with a new > empty one and repair is run. > This can cause deleted data to come back in some cases. Also this is true for > corrupt stables in which we delete the corrupt stable and run repair. > Here is an example: > Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. > row=sankalp col=sankalp is written 20 days back and successfully went to all > three nodes. > Then a delete/tombstone was written successfully for the same row column 15 > days back. > Since this tombstone is more than gc grace, it got compacted in Nodes A and B > since it got compacted with the actual data. So there is no trace of this row > column in node A and B. > Now in node C, say the original data is in drive1 and tombstone is in drive2. > Compaction has not yet reclaimed the data and tombstone. > Drive2 becomes corrupt and was replaced with new empty drive. > Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp > has come back to life. > Now after replacing the drive we run repair. This data will be propagated to > all nodes. > Note: This is still a problem even if we run repair every gc grace. > -- This message was sent by Atlassian JIRA (v6.2#6252)