​

On Sun, Mar 12, 2017 at 2:42 PM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:

> Looking at the costs of cloud instances, it clearly appears the cost of
> CPU dictates the overall cost of the instance. Having 2X more cores
> increases cost by nearly 2X keeping other things same as can be seen below
> as an example:
>
> (C3 may have slightly better processor but not more than 10-15% peformance
> increase)
>
> Optimising for fewer CPU cycles will invariably reduce costs by a large
> factor. On a modern day machine with SSD's where data density on node can
> be high more requests can be assumed to be served from single node, things
> get CPU bound. Perhaps its because it was invented at a time when SSD's did
> not exist. If we observe closely, many of cassandra defaults are assuming
> disk is rotational - number of flush writers, concurrent compactors, etc.
> The design suggest that too (Using Sequential io as far as possible. Infact
> thats the underlying philosophy for sequential sstable flushes and
> sequential commitlog files to avoid random io). Perhaps if it was designed
> currently things may look radically different.
>
> Comparing an average hard disk - ~200 iops  vs ~40K for ssd thats approx
> 200 times increase effectively increasing expectation from processor to
> serve significantly higher ops per second.
>
> In order to extract best from a modern day node it may need significant
> changes such like below :
> https://issues.apache.org/jira/browse/CASSANDRA-10989
> Possibly going forward the number of cores per node is only going to
> increase as it has been seen for last 5-6 years. In a way thats suggesting
> a significant change in design and possibly thats what scylladb is upto.
>
> "We found that we need a cpu scheduler which takes into account the
> priority of different tasks, such as repair, compaction, streaming, read
> operations and write operations."
> From my understanding in Cassandra as well compaction threads run on low
> nice priority - not sure about repair/streaming.
> http://grokbase.com/t/cassandra/user/14a85xpce7/significant-nice-cpu-usage
>
> Regards,
>
> On Sun, Mar 12, 2017 at 2:35 PM, Avi Kivity <a...@scylladb.com> wrote:
>
>> btw, for an example of how user-level tasks can be scheduled in a way
>> that cannot be done with kernel threads, see this pair of blog posts:
>>
>>
>>   http://www.scylladb.com/2016/04/14/io-scheduler-1/
>>
>>   http://www.scylladb.com/2016/04/29/io-scheduler-2/
>>
>>
>> There's simply no way to get this kind of control when you rely on the
>> kernel for scheduling and page cache management.  As a result you have to
>> overprovision your node and then you mostly underutilize it.
>>
>> On 03/12/2017 10:23 AM, Avi Kivity wrote:
>>
>>
>>
>> On 03/12/2017 12:19 AM, Kant Kodali wrote:
>>
>> My response is inline.
>>
>> On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity <a...@scylladb.com> wrote:
>>
>>> There are several issues at play here.
>>>
>>> First, a database runs a large number of concurrent operations, each of
>>> which only consumes a small amount of CPU. The high concurrency is need to
>>> hide latency: disk latency, or the latency of contacting a remote node.
>>>
>>
>> *Ok so you are talking about hiding I/O latency.  If all these I/O are
>> non-blocking system calls then a thread per core and callback mechanism
>> should suffice isn't it?*
>>
>>
>>
>> Scylla uses a mix of user-level threads and callbacks. Most of the code
>> uses callbacks (fronted by a future/promise API). SSTable writers
>> (memtable flush, compaction) use a user-level thread (internally
>> implemented using callbacks).  The important bit is multiplexing many
>> concurrent operations onto a single kernel thread.
>>
>>
>> This means that the scheduler will need to switch contexts very often. A
>>> kernel thread scheduler knows very little about the application, so it has
>>> to switch a lot of context.  A user level scheduler is tightly bound to the
>>> application, so it can perform the switching faster.
>>>
>>
>> *sure but this applies in other direction as well. A user level scheduler
>> has no idea about kernel level scheduler either.  There is literally no
>> coordination between kernel level scheduler and user level scheduler in
>> linux or any major OS. It may be possible with OS's that support scheduler
>> activation(LWP's) and upcall mechanism. *
>>
>>
>> There is no need for coordination, because the kernel scheduler has no
>> scheduling decisions to make.  With one thread per core, bound to its core,
>> the kernel scheduler can't make the wrong decision because it has just one
>> choice.
>>
>>
>> *Even then it is hard to say if it is all worth it (The research shows
>> performance may not outweigh the complexity). Golang problem is exactly
>> this if one creates 1000 go routines/green threads where each of them is
>> making a blocking system call then it would create 1000 kernel threads
>> underneath because it has no way to know that the kernel thread is blocked
>> (no upcall). *
>>
>>
>> All of the significant system calls we issue are through the main thread,
>> either asynchronous or non-blocking.
>>
>> *And in non-blocking case I still don't even see a significant
>> performance when compared to few kernel threads with callback mechanism.*
>>
>>
>> We do.
>>
>>
>> *  If you are saying user level scheduling is the Future (perhaps I would
>> just let the researchers argue about it) As of today that is not case else
>> languages would have had it natively instead of using third party
>> frameworks or libraries. *
>>
>>
>> User-level scheduling is great for high performance I/O intensive
>> applications like databases and file systems.  It's not a general solution,
>> and it involves a lot of effort to set up the infrastructure. However, for
>> our use case, it was worth it.
>>
>>
>>
>>> There are also implications on the concurrency primitives in use (locks
>>> etc.) -- they will be much faster for the user-level scheduler, because
>>> they cooperate with the scheduler.  For example, no atomic
>>> read-modify-write instructions need to be executed.
>>>
>>
>>
>>      Second, how many (kernel) threads should you run?
>> * This question one will always have. If there are 10K user level threads
>> that maps to only one kernel thread then they cannot exploit parallelism.
>> so there is no right answer but a thread per core is a reasonable/good
>> choice. *
>>
>>
>> Only if you can multiplex many operations on top of each of those
>> threads.  Otherwise, the CPUs end up underutilized.
>>
>>
>>
>>> If you run too few threads, then you will not be able to saturate the
>>> CPU resources.  This is a common problem with Cassandra -- it's very hard
>>> to get it to consume all of the CPU power on even a moderately large
>>> machine. On the other hand, if you have too many threads, you will see
>>> latency rise very quickly, because kernel scheduling granularity is on the
>>> order of milliseconds.  User-level scheduling, because it leaves control in
>>> the hand of the application, allows you to both saturate the CPU and
>>> maintain low latency.
>>>
>>
>>     F*or my workload and probably others I had seen Cassandra was always
>> been CPU bound.*
>>
>>>
>>>
>>
>> Yes, but does it consume 100% of all of the cores on your machine?
>> Cassandra generally doesn't (on a larger machine), and when you profile it,
>> you see it spending much of its time in atomic operations, or
>> parking/unparking threads -- fighting with itself.  It doesn't scale within
>> the machine.  Scylla will happily utilize all of the cores that it is
>> assigned (all of them by default in most configurations), and the bigger
>> the machine you give it, the happier it will be.
>>
>> There are other factors, like NUMA-friendliness, but in the end it all
>>> boils down to efficiency and control.
>>>
>>> None of this is new btw, it's pretty common in the storage world.
>>>
>>> Avi
>>>
>>>
>>> On 03/11/2017 11:18 PM, Kant Kodali wrote:
>>>
>>> Here is the Java version http://docs.paralleluniverse.co/quasar/ but I
>>> still don't see how user level scheduling can be beneficial (This is a well
>>> debated problem)? How can this add to the performance? or say why is user
>>> level scheduling necessary Given the Thread per core design and the
>>> callback mechanism?
>>>
>>> On Sat, Mar 11, 2017 at 12:51 PM, Avi Kivity <a...@scylladb.com> wrote:
>>>
>>>> Scylla uses a the seastar framework, which provides for both user-level
>>>> thread scheduling and simple run-to-completion tasks.
>>>>
>>>> Huge pages are limited to 2MB (and 1GB, but these aren't available as
>>>> transparent hugepages).
>>>>
>>>>
>>>> On 03/11/2017 10:26 PM, Kant Kodali wrote:
>>>>
>>>> @Dor
>>>>
>>>> 1) You guys have a CPU scheduler? you mean user level thread Scheduler
>>>> that maps user level threads to kernel level threads? I thought C++ by
>>>> default creates native kernel threads but sure nothing will stop someone to
>>>> create a user level scheduling library if that's what you are talking 
>>>> about?
>>>> 2) How can one create THP of size 1KB? According to this post
>>>> <https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html>
>>>>  it
>>>> looks like the valid values 2MB and 1GB.
>>>>
>>>> Thanks,
>>>> kant
>>>>
>>>> On Sat, Mar 11, 2017 at 11:41 AM, Avi Kivity <a...@scylladb.com> wrote:
>>>>
>>>>> Agreed, I'd recommend to treat benchmarks as a rough guide to see
>>>>> where there is potential, and follow through with your own tests.
>>>>>
>>>>> On 03/11/2017 09:37 PM, Edward Capriolo wrote:
>>>>>
>>>>>
>>>>> Benchmarks are great for FUDly blog posts. Real world work loads
>>>>> matter more. Every NoSQL vendor wins their benchmarks.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>

Reply via email to