On Sun, Mar 12, 2017 at 2:42 PM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
> Looking at the costs of cloud instances, it clearly appears the cost of > CPU dictates the overall cost of the instance. Having 2X more cores > increases cost by nearly 2X keeping other things same as can be seen below > as an example: > > (C3 may have slightly better processor but not more than 10-15% peformance > increase) > > Optimising for fewer CPU cycles will invariably reduce costs by a large > factor. On a modern day machine with SSD's where data density on node can > be high more requests can be assumed to be served from single node, things > get CPU bound. Perhaps its because it was invented at a time when SSD's did > not exist. If we observe closely, many of cassandra defaults are assuming > disk is rotational - number of flush writers, concurrent compactors, etc. > The design suggest that too (Using Sequential io as far as possible. Infact > thats the underlying philosophy for sequential sstable flushes and > sequential commitlog files to avoid random io). Perhaps if it was designed > currently things may look radically different. > > Comparing an average hard disk - ~200 iops vs ~40K for ssd thats approx > 200 times increase effectively increasing expectation from processor to > serve significantly higher ops per second. > > In order to extract best from a modern day node it may need significant > changes such like below : > https://issues.apache.org/jira/browse/CASSANDRA-10989 > Possibly going forward the number of cores per node is only going to > increase as it has been seen for last 5-6 years. In a way thats suggesting > a significant change in design and possibly thats what scylladb is upto. > > "We found that we need a cpu scheduler which takes into account the > priority of different tasks, such as repair, compaction, streaming, read > operations and write operations." > From my understanding in Cassandra as well compaction threads run on low > nice priority - not sure about repair/streaming. > http://grokbase.com/t/cassandra/user/14a85xpce7/significant-nice-cpu-usage > > Regards, > > On Sun, Mar 12, 2017 at 2:35 PM, Avi Kivity <a...@scylladb.com> wrote: > >> btw, for an example of how user-level tasks can be scheduled in a way >> that cannot be done with kernel threads, see this pair of blog posts: >> >> >> http://www.scylladb.com/2016/04/14/io-scheduler-1/ >> >> http://www.scylladb.com/2016/04/29/io-scheduler-2/ >> >> >> There's simply no way to get this kind of control when you rely on the >> kernel for scheduling and page cache management. As a result you have to >> overprovision your node and then you mostly underutilize it. >> >> On 03/12/2017 10:23 AM, Avi Kivity wrote: >> >> >> >> On 03/12/2017 12:19 AM, Kant Kodali wrote: >> >> My response is inline. >> >> On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity <a...@scylladb.com> wrote: >> >>> There are several issues at play here. >>> >>> First, a database runs a large number of concurrent operations, each of >>> which only consumes a small amount of CPU. The high concurrency is need to >>> hide latency: disk latency, or the latency of contacting a remote node. >>> >> >> *Ok so you are talking about hiding I/O latency. If all these I/O are >> non-blocking system calls then a thread per core and callback mechanism >> should suffice isn't it?* >> >> >> >> Scylla uses a mix of user-level threads and callbacks. Most of the code >> uses callbacks (fronted by a future/promise API). SSTable writers >> (memtable flush, compaction) use a user-level thread (internally >> implemented using callbacks). The important bit is multiplexing many >> concurrent operations onto a single kernel thread. >> >> >> This means that the scheduler will need to switch contexts very often. A >>> kernel thread scheduler knows very little about the application, so it has >>> to switch a lot of context. A user level scheduler is tightly bound to the >>> application, so it can perform the switching faster. >>> >> >> *sure but this applies in other direction as well. A user level scheduler >> has no idea about kernel level scheduler either. There is literally no >> coordination between kernel level scheduler and user level scheduler in >> linux or any major OS. It may be possible with OS's that support scheduler >> activation(LWP's) and upcall mechanism. * >> >> >> There is no need for coordination, because the kernel scheduler has no >> scheduling decisions to make. With one thread per core, bound to its core, >> the kernel scheduler can't make the wrong decision because it has just one >> choice. >> >> >> *Even then it is hard to say if it is all worth it (The research shows >> performance may not outweigh the complexity). Golang problem is exactly >> this if one creates 1000 go routines/green threads where each of them is >> making a blocking system call then it would create 1000 kernel threads >> underneath because it has no way to know that the kernel thread is blocked >> (no upcall). * >> >> >> All of the significant system calls we issue are through the main thread, >> either asynchronous or non-blocking. >> >> *And in non-blocking case I still don't even see a significant >> performance when compared to few kernel threads with callback mechanism.* >> >> >> We do. >> >> >> * If you are saying user level scheduling is the Future (perhaps I would >> just let the researchers argue about it) As of today that is not case else >> languages would have had it natively instead of using third party >> frameworks or libraries. * >> >> >> User-level scheduling is great for high performance I/O intensive >> applications like databases and file systems. It's not a general solution, >> and it involves a lot of effort to set up the infrastructure. However, for >> our use case, it was worth it. >> >> >> >>> There are also implications on the concurrency primitives in use (locks >>> etc.) -- they will be much faster for the user-level scheduler, because >>> they cooperate with the scheduler. For example, no atomic >>> read-modify-write instructions need to be executed. >>> >> >> >> Second, how many (kernel) threads should you run? >> * This question one will always have. If there are 10K user level threads >> that maps to only one kernel thread then they cannot exploit parallelism. >> so there is no right answer but a thread per core is a reasonable/good >> choice. * >> >> >> Only if you can multiplex many operations on top of each of those >> threads. Otherwise, the CPUs end up underutilized. >> >> >> >>> If you run too few threads, then you will not be able to saturate the >>> CPU resources. This is a common problem with Cassandra -- it's very hard >>> to get it to consume all of the CPU power on even a moderately large >>> machine. On the other hand, if you have too many threads, you will see >>> latency rise very quickly, because kernel scheduling granularity is on the >>> order of milliseconds. User-level scheduling, because it leaves control in >>> the hand of the application, allows you to both saturate the CPU and >>> maintain low latency. >>> >> >> F*or my workload and probably others I had seen Cassandra was always >> been CPU bound.* >> >>> >>> >> >> Yes, but does it consume 100% of all of the cores on your machine? >> Cassandra generally doesn't (on a larger machine), and when you profile it, >> you see it spending much of its time in atomic operations, or >> parking/unparking threads -- fighting with itself. It doesn't scale within >> the machine. Scylla will happily utilize all of the cores that it is >> assigned (all of them by default in most configurations), and the bigger >> the machine you give it, the happier it will be. >> >> There are other factors, like NUMA-friendliness, but in the end it all >>> boils down to efficiency and control. >>> >>> None of this is new btw, it's pretty common in the storage world. >>> >>> Avi >>> >>> >>> On 03/11/2017 11:18 PM, Kant Kodali wrote: >>> >>> Here is the Java version http://docs.paralleluniverse.co/quasar/ but I >>> still don't see how user level scheduling can be beneficial (This is a well >>> debated problem)? How can this add to the performance? or say why is user >>> level scheduling necessary Given the Thread per core design and the >>> callback mechanism? >>> >>> On Sat, Mar 11, 2017 at 12:51 PM, Avi Kivity <a...@scylladb.com> wrote: >>> >>>> Scylla uses a the seastar framework, which provides for both user-level >>>> thread scheduling and simple run-to-completion tasks. >>>> >>>> Huge pages are limited to 2MB (and 1GB, but these aren't available as >>>> transparent hugepages). >>>> >>>> >>>> On 03/11/2017 10:26 PM, Kant Kodali wrote: >>>> >>>> @Dor >>>> >>>> 1) You guys have a CPU scheduler? you mean user level thread Scheduler >>>> that maps user level threads to kernel level threads? I thought C++ by >>>> default creates native kernel threads but sure nothing will stop someone to >>>> create a user level scheduling library if that's what you are talking >>>> about? >>>> 2) How can one create THP of size 1KB? According to this post >>>> <https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html> >>>> it >>>> looks like the valid values 2MB and 1GB. >>>> >>>> Thanks, >>>> kant >>>> >>>> On Sat, Mar 11, 2017 at 11:41 AM, Avi Kivity <a...@scylladb.com> wrote: >>>> >>>>> Agreed, I'd recommend to treat benchmarks as a rough guide to see >>>>> where there is potential, and follow through with your own tests. >>>>> >>>>> On 03/11/2017 09:37 PM, Edward Capriolo wrote: >>>>> >>>>> >>>>> Benchmarks are great for FUDly blog posts. Real world work loads >>>>> matter more. Every NoSQL vendor wins their benchmarks. >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> >